New experimental prompt strategy to only show valid and invalid combinations of…

New experimental prompt strategy to only show valid and invalid combinations of behavior emojis, instead of example phrases. Works well with gpt-4o-mini and llama3-70b. Does not work great with gpt-3.5-turbo. Also added negative LLM unit tests checks for behaviors (FOLLOW and not ATTACK).
4 jobs from emoji-behaviors in 2 minutes 34 seconds (queued for 1 second)
Status Job ID Name Coverage
  Build
passed #41014
minecraft
build_mod

02:34

 
  Test
manual #41015
minecraft allowed to fail manual
gpt-3.5-turbo
manual #41016
minecraft allowed to fail manual
gpt-4o
manual #41017
minecraft allowed to fail manual
llama3-8b