<aside> <img src="/icons/help-alternate_gray.svg" alt="/icons/help-alternate_gray.svg" width="40px" /> A poster from a conference on Causal Cognition in Humans and Machines, Oxford 2024.

</aside>

What embedding pictures obscure

Two-dimensional representations of multidimensional embeddings misleads us about the true nature of the problem space.

(Wolfram 2023)

(Wolfram 2023)

We can see much more richness in corpus word sketches:

Untitled

All of these details are captured by embeddings but in an inconsistent way as can be seen in cross linguistic perspective.

Untitled

Untitled

Untitled

This gets even more complicated when we capture the distributional irregularities of complex morphology.

(Framenet)

(Framenet)

The case of Czech totiž (sort of because)

Totiž is a Czech causal discourse marker that means something like therefore but is almost never translated that way. Often, it is not translated at all. When ChatGPT is asked to translate totiž from real examples it is always successful (even when compared to professional translators) but when it is asked to give examples of totiž or use it in a text, it will be wrong 100% of the time.

Untitled

All of these translations by ChatGPT (left) are correct and comparable to human translations by professional translators (right).

Every single example generated in this context is ungrammatical Czech or incorrect usage.

Questions

What are the two different spaces that GPT4 represent totiž in? What layer represents the tokens together? One where it is clearly represented as a causal marker and one where it is represented in its full morphosyntactic, semantic and collocational space?

How is the causal relationship between the text components represented within ChatGPT during translation when totiž is not translated by a word or an identifiable phrase? What is the contribution of the attention mechanism?