Why are Large Language Models so capable at understanding language?

Despite the inherent unexplainability of LLMs, convincing attempts at explaining their brilliance are being made.

I. Introduction
II. Intentions and messages
III. Latent space model
IV. Inferring intention with marginal distribution of messages
V. In-context learning
VI. Chain-of-thought prompting
VII. Conclusion

I. Introduction

It goes without saying that LLMs (Large Language Models) are a revolutionary technology. Not only are their science fiction-like capabilities mesmerizing for users and promising for investors, but also the lack of apparent limitations to their potential has led to the seemingly never-ending wave of ever-growing and increasingly impressive models. The underlying scaling laws remain elusive, with the on-going discussions around terms like emergent abilities (which are essentially attempts to formalize unpredictability) strongly indicating we’re not nearing any definitive conclusions.

However, researchers are tirelessly working towards the goal of understanding LLMs. Given the inherent lack of explainability in machine learning, many of the results revolve around empirical verification of what LLMs are capable of. Nevertheless, some results take on the task of closer evaluation of their architecture (which is hampered by the prohibitive costs of training), to answer some of the why questions, even if these results can only be speculative.

One of the theories possibly explaining the scaling laws of LLMs is that they are approximators of the latent space model. In this article, we’ll take a closer look at this idea (as defined in [1]), and briefly summarize how it might relate to the crucial properties of LLMs, such as in-context learning or chain-of-thought prompting.

II. Intentions and messages

(back to ToC ↑)

Before LLMs, many attempts at automated generation of coherent text have been made, starting in the 60s (e. g. Eliza). Some of them have heavily revolved around the concept of messages and intentions, which we’ll also be using in this article.

A message is an instance of the simplest unit in a language sufficient to convey meaningful information (for instance, in a programming language this would be a statement, and in a spoken language this would be a sentence). An example of a message could be ‘I have a cat’.

Intention is an instance of the simplest unit of information that we would like to convey during communication. Intentions are expressed with messages. One intention can be usually expressed with many different messages. For example - all of the messages: ‘I have a cat’, ‘I own a cat’ and ‘My pet is a cat’ are describing the same intention.

In some languages, the intention can be recovered from the message with certainty. This holds by design for programming languages (the action described by a statement is always clear, though not always meaningful). Such languages are called unambiguous. Other languages (most importantly, the spoken ones) are denoted as ambiguous, and inferring intentions from their messages usually involves some degree of uncertainty.

III. Latent space model

(back to ToC ↑)

While the syntactic rules of natural languages are usually precise (though sometimes a bit convoluted) and can be easily translated to a computer program, capturing the semantics of human communication is much less trivial. The key obstacle is that in order to convincingly continue the communication, we have to be able to infer the underlying intention from the message itself. This is difficult, as the space of possible intentions is seemingly dependent on the world we live in.

To capture this intuition, the latent space model of stochastic language generation has been introduced.

The main purpose of the latent space model is to sample sequences of $h$ messages, where $h$ is an integer parameter of the whole model. The probability of sampling an $h$ -sequence $x_1, \ldots, x_h$ is denoted by $L(x_1, \ldots, x_h)$ . For ${k < h,}$ by $L(x_1, \ldots, x_k)$ we denote the probability that the sampled sequence starts with $x_1, \ldots, x_k$ .

Building on those distributions, we can easily define contextual sampling of continuations. Probability that a message $x_k$ follows a context of $x_1, \ldots, x_{k-1}$ is defined as

L(x_k | x_1, \ldots x_{k-1}) = \frac{L(x_1, \ldots, x_{k})}{L(x_1, \ldots, x_{k-1})}.

The key assumption of the latent space model is that language generation is a 2-step process:

first, we sample intention, which intuitively is “what we want to say”,
then, out of all messages conveying our sampled intention, we sample one to decide “how we say it”.

Formally, $h$ -sequences of messages are sampled as follows:

we sample a sequence of intentions $\theta_1, \ldots, \theta_h$ from a distribution $I(\theta_1, \ldots, \theta_h)$ . This distribution is another input parameter to the latent space model.
then, for every intention $\theta_i$ , we sample one of messages conveying this intention from the distribution $M(x_i \vert \theta_i)$ . Similarly, distributions $M(x \vert \theta)$ are input parameters of the latent space model.

Putting everything together, the probability of sampling a sequence of messages $x_1, \ldots, x_h$ is equivalent to

L(x_1, \ldots, x_h) = \sum_{\theta_1, \ldots, \theta_h} I(\theta_1, \ldots, \theta_h) \prod_{i=1}^h M(x_i | \theta_i).

The ideal latent space model is achieved by using input distributions that accurately reflect real-life data, and it is the holy grail of automated natural language generation.

Over the course of last 40 years, many attempts at implementation of the ideal latent space model have been made ([2], [3]). Most of them made the assumption that explicit descriptions of the true to real life distributions $I(\theta_1, \ldots, \theta_h)$ and $M(x \vert \theta)$ are necessary. For this reason, none of them was convincing. A precise understanding of these distributions is difficult to achieve, as the notion of intention is very abstract, and possible nuances in long sequences of intentions are extremely difficult to capture, even if we limit ourselves to a single task. In addition, the underlying intention space is directly connected to the world we live in.

LLM-based approach is different, and revolves around using a model to approximate the marginal distribution $L(x_1, \ldots, x_h)$ based on the training data. What’s interesting is that despite no intermediate steps devoted to intention sampling, the estimation they achieve is an approximation of the ideal latent space model. Formally, it can be proven that

Transformer-based LLMs trained on text corpus sampled from the ideal latent space model are approximators of the distribution $L(x_1, \ldots, x_k)$ .

The key assumption is that the training corpus is sampled from the ideal latent space model. However, it is reasonable - we train LLMs on text produced by actual humans (or at least we hope so).

Intuitively, LLM with an infinite number of parameters and trained on an infinitely-sized text corpus “knows” the exact values of $L(x_1, \ldots, x_h)$ .

State-of-the-art LLMs have trillions of parameters, and are trained on multiple trillion tokens, so it is possible that they are within a “touching distance” of infinity.

In following sections, we’ll explore how this property can be used to explain many strengths of large language models.

IV. Inferring intention with marginal distribution of messages

(back to ToC ↑)

As we mentioned before, to maintain a natural conversation it is crucial to infer intentions. Empirically, LLMs are clearly excellent at this. Now, let’s go over a sketch of a theoretical proof of this fact.

Let $\Theta_i$ be the random variable describing the intention underlying the $i$ -th message in the randomly sampled sequence from the ideal latent space model. By $L(x_1, \ldots, x_k, \Theta_i=\theta)$ we denote the probability of sampling the sequence $x_1, \ldots, x_k$ and that the $i$ -th intention in the underlying intention sequence (which is also sampled randomly!) is equivalent to $\theta$ .

We also set

L(x_k | x_1, \ldots, x_{k - 1}, \Theta_1 = \theta) = \frac{L(x_1, \ldots, x_k, \Theta_1 = \theta)}{L(x_1, \ldots, x_{k -1}, \Theta_1 = \theta)}.

First, let’s focus on the case of unambiguous languages, in which the intention can be recovered from the message with certainty. Recall that this property holds for programming languages. This analysis serves as a direct justification of applicability of LLMs for code generation.

Let $x$ be a message in an unambiguous language, and let $\theta_x$ be its unique underlying intention. We have

\begin{equation} L(x , \Theta_1 = \theta) = \begin{cases} L(x) & \text{if $\theta = \theta_x$} \\ 0 & \text{else.} \\ \end{cases} \end{equation}

Consider a “perfect” LLM, whose distribution of messages $(P(x_1, x_2))$ is equivalent to the marginal distribution in the ideal latent space model with $h = 2$ . When prompted with a message $x$ that conveys the intention $\theta_x$ , the probability of generating the continuation $y$ is

P(y | x) = \frac{L(x, y)}{L(x)}.

The probability of sampling sequence $x, y$ can be rewritten as

L(x, y) = \sum_{\theta \in \Theta} L(x, y , \Theta_1 = \theta) = \sum_{\theta \in \Theta} L(x, \Theta_1 = \theta) L(y|x, \Theta_1 = \theta) = L(x)L(y | x, \Theta_1 = \theta_x).

The last transition follows from $(1)$ .

Altogether, we obtain

P(y | x) = L(y | x, \Theta_1=\theta_x),

which means that the distribution of possible responses is equivalent to the situation, in which the intention behind the prompt was provided explicitly.

The notion of unambiguity isn’t applicable to natural languages. However, it is often assumed that natural language are $\varepsilon$ -ambiguous - in other words, there exists $\varepsilon \in [0, 1]$ , such that the intention can be inferred from the message with the probability of at least $1 - \varepsilon$ (for all messages!). This assumption is natural - while ambiguity is present in our communication, it is usually efficiently overcome with redundancy. There are no truly meaningless or gibberish phrases in spoken languages.

To accommodate for ambiguity, we can generalize the result on intention inference to

|P(y | x) - L(y | x, \Theta_1=\theta_x)| \leq \varepsilon(x),

where $\varepsilon(x)$ denotes the ambiguity of the prompt. For $\varepsilon$ -ambiguous languages, it is always upper-bounded by $\varepsilon$ .

It can also be proven that if we provide more messages conveying the same intention as inputs, then the ambiguity decreases exponentially. This formalizes the observation that redundancy in the prompt increases the reliability of the results.

On the other hand, in human communication, the ambiguity of a message is decreased by external factors, such as body language or shared experience and knowledge. There is clearly no way to replicate these circumstances in LLMs, so this might be considered a shortcoming.

V. In-context learning

(back to ToC ↑)

Capitalizing on results from the previous section, one can quickly derive a formula for efficiency of few-shot prompting. Assume we are prompting the LLM for a completion of the input $i_{m+1}$ with instruction $x$ , providing $m$ example input-output pairs $(i_1, o_1), \ldots, (i_m, o_m)$ , which are all trying to convey the same intention $\theta_*$ . For an $\varepsilon$ -ambiguous language, it holds

|P(y|x,i_1, o_1, \ldots, i_m, o_m) - L(y | x, \Theta_1 = \theta_*)| \leq \varepsilon ^ m \cdot \varepsilon(x) \cdot \varepsilon(i_{m+1}) \leq \varepsilon ^ {m + 2},

which clearly underlines the significance of the examples, while demonstrating that the point of diminishing returns with respect to their number usually arrives relatively quickly.

VI. Chain-of-thought prompting

(back to ToC ↑)

Discussions surrounding genuity of reasoning capabilities of LLMs are still ongoing, and the latent space model is somewhat orthogonal to these considerations (though it’s slightly leaning towards the hypothesis of brute-force memoization of causal transitions). However, it does make it apparent how models are capable of capitalizing on their superior accuracy of simpler logical transitions.

By $I(\theta_k \vert \theta_1, \ldots, \theta_{k - 1})$ we denote the probability of a logical transition, and we define it is as

I(\theta_k \vert \theta_1, \ldots, \theta_{k - 1}) = \frac{I(\theta_1, \ldots, \theta_k)}{I(\theta_1, \ldots, \theta_{k - 1})}.

When $k < h$ , $I(\theta_1, \ldots, \theta_k)$ is defined analogously to $L(x_1, \ldots, x_k)$ .

For simplicity, consider the few-shot chain-of-thought prompting technique and assume we’re providing examples showcasing a coherent chain-of-thought ${\theta_1 \rightarrow \ldots \rightarrow \theta_k}$ . Let $x_1$ be the input message, and let $x_k$ be an example message reaching the correct conclusion. When zero-shot prompting directly for a conclusion, the probability of the model returning the correct answer is equivalent to

P(x_k | x_1) = \frac{L(x_1, x_k)}{L(x_1)} = \frac{I(\theta_1, \theta_k)M(x_1 | \theta_1)M(x_k | \theta_k)}{I(\theta_1)M(x_1 | \theta_1)} = I(\theta_k| \theta_1)M(x_k | \theta_k).

Most notably, it the transition $\theta_1 \rightarrow \theta_k$ is under-represented in the training corpus, this probability is low. This wouldn’t change even if we considered all of correct $x_k$ simultaneously.

In the case of chain-of-thought, this probability changes to (at least)

P(x_k | x_1) \geq (1 - \varepsilon^{m + 2})^{k - 1} \cdot I(\theta_k|\theta_1, \ldots, \theta_{k - 1}) \cdot M(x_k|\theta_k),

We control the number of examples, so we can easily assume that the first term is close to 1. The middle term is capturing the soundness of the showcased chain-of-thought with respect to the training corpus of the LLM. As empirical experiments show, for many different applications the task of finding such is perfectly viable.

VII. Conclusion

(back to ToC ↑)

While explanations of proficiency of LLMs are still speculative, mostly due to the insurmountable volume of parameters and training corpora, our understanding of their abilities and limitations are continuously increasing. A formal proof that transformer-based LLMs are capturing the marginal distributions in the ideal latent space model is a tangible confirmation of some breakthrough, as this is a decades-old problem, with many documented, fruitless attempts. In addition, it provides a neat explanation of multiple advantages of LLMs, which might not be definitive, but is definitely intuitive.

Knowledge