New experimental prompt strategy to only show valid and invalid combinations of…

New experimental prompt strategy to only show valid and invalid combinations of behavior emojis, instead of example phrases. Works well with gpt-4o-mini and llama3-70b. Does not work great with gpt-3.5-turbo. Also added negative LLM unit tests checks for behaviors (FOLLOW and not ATTACK).
4 jobs from emoji-behaviors in 1 minute 49 seconds (queued for 7 minutes 58 seconds)
Status Job ID Name Coverage
  Build
passed #41987
minecraft
build_mod

01:49

 
  Test
manual #41988
minecraft allowed to fail manual
gpt-3.5-turbo
manual #41989
minecraft allowed to fail manual
gpt-4o
manual #41990
minecraft allowed to fail manual
llama3-8b