Improved LLM unit tests for UNFLEE (trying to prevent failures for brave archer)

Status Job ID Name Coverage
  Build
canceled #42136
minecraft
build_mod

 
  Test
skipped #42137
minecraft allowed to fail manual
gpt-3.5-turbo
skipped #42138
minecraft allowed to fail manual
gpt-4o
skipped #42139
minecraft allowed to fail manual
llama3-8b