<aside> 💡

Prompts to use to evaluate frontier capabilities of Large Language Models.

I use a selection of these prompts to evaluate a new model on the limits of performance.

Some of these prompts are only useful for testing smaller models because the frontier models always perform well.

By Dominik Lukeš. Updated 29 Mar 2025.

</aside>

Prompts to test spatial cognition

Recognition of vertical text

All models fail at these tasks most of the time.

T   t   i   s   m
h   e   t   e   e
i   x       c   s
s   t   o   r   s
        n   e   a
    h   e   t   g
    a           e
    s            

    i            
    n            

V t m s L
e e a e L
r x k n M
t t e s s
i   s e
c
a   n t
l   o o
I  t  e  d  t  t  t  l  g  m  L  M  t  t
f  h  n  i  h  e  h  o  e  o  a  o  o  h
   i  o  s  e  x  i  n  t  r  r  d     i
y  s  u  c  r  t  s  g  s  e  g  e  b  s
o     g  o  e     .  e        e  l  e  .
u  c  h  v     h     r  i  c            
   a     e  i  i  A     n  h  L  b  b   
l  r  y  r  s  d  s  a  c  a  a  u  e   
o  e  o        d     n  r  l  n  t  s   
o  f  u  t  a  e  i  d  e  l  g     t   
k  u     h  n  n  t     a  e  u  o      
   l  w  a           l  s  n  a  3  a   
a  l  i  t  a  i  g  o  i  g  g     t   
t  y  l     c  n  e  n  n  i  e  s      
      l     t     t  g  g  n     e      
            u     s  e  l  g     e      
            a        r  y        m      
            l              f     s      
                     i     o            
                     t     r            
                                        
                           a            
                           n            
                           y            

Reasoning about spatial relationships encoded in language

<aside> 💡

This prompt asks the model to

</aside>

Come up with contexts in which the sentence The cathedral is in front of the statue makes sense.

Visual reasoning

Make an SVG of a bicyle.

Prompts to test coding capability and reasoning

<aside> 💡

This single prompt tests the models knowledge of readability formulas, its ability to implement them and output enough code to achieve this. This is also a good prompt to test coding agents. Claude 3.7 Sonnet performs best at this as a single shot with Gemini 2.5 a close second.

</aside>

Make a detailed, visually interesting readability analysis tool.

Prompts to test metalanguage and anaphora

<aside> 💡

These two prompts test the ability to disambiguate anaphora. The first one is a single sentence and the second one expands.

</aside>

Make a table of all the pronouns in this sentence matched to what noun they refer to: Natasha was afraid that her mother would be worried and that's why she didn't tell her about her accident.
Make a table of all the pronouns in this text matched to what noun they refer to (for possessive pronouns include the noun and for subject or objects include the predicate in the table to make it easier to read): 

Natasha was afraid that her mother would be worried and that's why she didn't tell her about her accident. Her reaction, after she eventually told her about it, convinced her she had been right to wait because she was inconsolable until she promised her she would never drive in the dark again.