<aside> 💡
Prompts to use to evaluate frontier capabilities of Large Language Models.
I use a selection of these prompts to evaluate a new model on the limits of performance.
Some of these prompts are only useful for testing smaller models because the frontier models always perform well.
By Dominik Lukeš. Updated 29 Mar 2025.
</aside>
All models fail at these tasks most of the time.
T t i s m
h e t e e
i x c s
s t o r s
n e a
h e t g
a e
s
i
n
V t m s L
e e a e L
r x k n M
t t e s s
i s e
c
a n t
l o o
<aside> 💡
This prompt asks the model to
</aside>
Come up with contexts in which the sentence The cathedral is in front of the statue makes sense.
<aside> 💡
This single prompt tests the models knowledge of readability formulas, its ability to implement them and output enough code to achieve this. This is also a good prompt to test coding agents. Claude 3.7 Sonnet performs best at this as a single shot with Gemini 2.5 a close second.
</aside>
Make a detailed, visually interesting readability analysis tool.
<aside> 💡
Most models cannot output text longer than about 3,000 words (even if the actual limit is higher). This prompt tests their ability to maintain coherence over the length of the output. Claude 3.7 Sonnet and Gemini 2.5 are the best at this with Gemini 2.5 being the only one that can output the full 7,000 words but struggling to maintain academic style throughout.
</aside>
Write a full 7,000 word paper in the academic style including references about the dangers of placing the Earth's strategic icecream stockpile in the Sahara. Yes, this is a parody but I want high level of verisimilitude including length.