Knowledge

19 Nov 2024 Back to all articles ↵

Why are Large Language Models so capable at understanding language?

#LLM #GenAI #theoretical
Despite the inherent unexplainability of LLMs, convincing attempts at explaining their brilliance are being made.

Table of contents

I. Introduction

(back to ToC ↑)

It goes without saying that LLMs (Large Language Models) are a revolutionary technology. Not only are their science fiction-like capabilities mesmerizing for users and promising for investors, but also the lack of apparent limitations to their potential has led to the seemingly never-ending wave of ever-growing and increasingly impressive models. The underlying scaling laws remain elusive, with the on-going discussions around terms like emergent abilities (which are essentially attempts to formalize unpredictability) strongly indicating we’re not nearing any definitive conclusions.

However, researchers are tirelessly working towards the goal of understanding LLMs. Given the inherent lack of explainability in machine learning, many of the results revolve around empirical verification of what LLMs are capable of. Nevertheless, some results take on the task of closer evaluation of their architecture (which is hampered by the prohibitive costs of training), to answer some of the why questions, even if these results can only be speculative.

One of the theories possibly explaining the scaling laws of LLMs is that they are approximators of the latent space model. In this article, we’ll take a closer look at this idea (as defined in [1]), and briefly summarize how it might relate to the crucial properties of LLMs, such as in-context learning or chain-of-thought prompting.

II. Intentions and messages

(back to ToC ↑)

Before LLMs, many attempts at automated generation of coherent text have been made, starting in the 60s (e. g. Eliza). Some of them have heavily revolved around the concept of messages and intentions, which we’ll also be using in this article.

A message is an instance of the simplest unit in a language sufficient to convey meaningful information (for instance, in a programming language this would be a statement, and in a spoken language this would be a sentence). An example of a message could be ‘I have a cat’.

Intention is an instance of the simplest unit of information that we would like to convey during communication. Intentions are expressed with messages. One intention can be usually expressed with many different messages. For example - all of the messages: ‘I have a cat’, ‘I own a cat’ and ‘My pet is a cat’ are describing the same intention.

In some languages, the intention can be recovered from the message with certainty. This holds by design for programming languages (the action described by a statement is always clear, though not always meaningful). Such languages are called unambiguous. Other languages (most importantly, the spoken ones) are denoted as ambiguous, and inferring intentions from their messages usually involves some degree of uncertainty.

III. Latent space model

(back to ToC ↑)

While the syntactic rules of natural languages are usually precise (though sometimes a bit convoluted) and can be easily translated to a computer program, capturing the semantics of human communication is much less trivial. The key obstacle is that in order to convincingly continue the communication, we have to be able to infer the underlying intention from the message itself. This is difficult, as the space of possible intentions is seemingly dependent on the world we live in.

To capture this intuition, the latent space model of stochastic language generation has been introduced.

The main purpose of the latent space model is to sample sequences of hh messages, where hh is an integer parameter of the whole model. The probability of sampling an hh-sequence x1,,xhx_1, \ldots, x_h is denoted by L(x1,,xh)L(x_1, \ldots, x_h). For k<h,{k < h,} by L(x1,,xk)L(x_1, \ldots, x_k) we denote the probability that the sampled sequence starts with x1,,xkx_1, \ldots, x_k.

Building on those distributions, we can easily define contextual sampling of continuations. Probability that a message xkx_k follows a context of x1,,xk1x_1, \ldots, x_{k-1} is defined as

L(xkx1,xk1)=L(x1,,xk)L(x1,,xk1).L(x_k | x_1, \ldots x_{k-1}) = \frac{L(x_1, \ldots, x_{k})}{L(x_1, \ldots, x_{k-1})}.

The key assumption of the latent space model is that language generation is a 2-step process:

  • first, we sample intention, which intuitively is “what we want to say”,
  • then, out of all messages conveying our sampled intention, we sample one to decide “how we say it”.

Formally, hh-sequences of messages are sampled as follows:

  • we sample a sequence of intentions θ1,,θh\theta_1, \ldots, \theta_h from a distribution I(θ1,,θh)I(\theta_1, \ldots, \theta_h). This distribution is another input parameter to the latent space model.
  • then, for every intention θi\theta_i, we sample one of messages conveying this intention from the distribution M(xiθi)M(x_i \vert \theta_i). Similarly, distributions M(xθ)M(x \vert \theta) are input parameters of the latent space model.

Putting everything together, the probability of sampling a sequence of messages x1,,xhx_1, \ldots, x_h is equivalent to

L(x1,,xh)=θ1,,θhI(θ1,,θh)i=1hM(xiθi).L(x_1, \ldots, x_h) = \sum_{\theta_1, \ldots, \theta_h} I(\theta_1, \ldots, \theta_h) \prod_{i=1}^h M(x_i | \theta_i).

The ideal latent space model is achieved by using input distributions that accurately reflect real-life data, and it is the holy grail of automated natural language generation.

Over the course of last 40 years, many attempts at implementation of the ideal latent space model have been made ([2], [3]). Most of them made the assumption that explicit descriptions of the true to real life distributions I(θ1,,θh)I(\theta_1, \ldots, \theta_h) and M(xθ)M(x \vert \theta) are necessary. For this reason, none of them was convincing. A precise understanding of these distributions is difficult to achieve, as the notion of intention is very abstract, and possible nuances in long sequences of intentions are extremely difficult to capture, even if we limit ourselves to a single task. In addition, the underlying intention space is directly connected to the world we live in.

LLM-based approach is different, and revolves around using a model to approximate the marginal distribution L(x1,,xh)L(x_1, \ldots, x_h) based on the training data. What’s interesting is that despite no intermediate steps devoted to intention sampling, the estimation they achieve is an approximation of the ideal latent space model. Formally, it can be proven that

Transformer-based LLMs trained on text corpus sampled from the ideal latent space model are approximators of the distribution L(x1,,xk)L(x_1, \ldots, x_k).

The key assumption is that the training corpus is sampled from the ideal latent space model. However, it is reasonable - we train LLMs on text produced by actual humans (or at least we hope so).

Intuitively, LLM with an infinite number of parameters and trained on an infinitely-sized text corpus “knows” the exact values of L(x1,,xh)L(x_1, \ldots, x_h).

State-of-the-art LLMs have trillions of parameters, and are trained on multiple trillion tokens, so it is possible that they are within a “touching distance” of infinity.

In following sections, we’ll explore how this property can be used to explain many strengths of large language models.

IV. Inferring intention with marginal distribution of messages

(back to ToC ↑)

As we mentioned before, to maintain a natural conversation it is crucial to infer intentions. Empirically, LLMs are clearly excellent at this. Now, let’s go over a sketch of a theoretical proof of this fact.

Let Θi\Theta_i be the random variable describing the intention underlying the ii-th message in the randomly sampled sequence from the ideal latent space model. By L(x1,,xk,Θi=θ)L(x_1, \ldots, x_k, \Theta_i=\theta) we denote the probability of sampling the sequence x1,,xkx_1, \ldots, x_k and that the ii-th intention in the underlying intention sequence (which is also sampled randomly!) is equivalent to θ\theta.

We also set

L(xkx1,,xk1,Θ1=θ)=L(x1,,xk,Θ1=θ)L(x1,,xk1,Θ1=θ).L(x_k | x_1, \ldots, x_{k - 1}, \Theta_1 = \theta) = \frac{L(x_1, \ldots, x_k, \Theta_1 = \theta)}{L(x_1, \ldots, x_{k -1}, \Theta_1 = \theta)}.

First, let’s focus on the case of unambiguous languages, in which the intention can be recovered from the message with certainty. Recall that this property holds for programming languages. This analysis serves as a direct justification of applicability of LLMs for code generation.

Let xx be a message in an unambiguous language, and let θx\theta_x be its unique underlying intention. We have

L(x,Θ1=θ)={L(x)if θ=θx0else.\begin{equation} L(x , \Theta_1 = \theta) = \begin{cases} L(x) & \text{if $\theta = \theta_x$} \\ 0 & \text{else.} \\ \end{cases} \end{equation}

Consider a “perfect” LLM, whose distribution of messages (P(x1,x2))(P(x_1, x_2)) is equivalent to the marginal distribution in the ideal latent space model with h=2h = 2. When prompted with a message xx that conveys the intention θx\theta_x, the probability of generating the continuation yy is

P(yx)=L(x,y)L(x).P(y | x) = \frac{L(x, y)}{L(x)}.

The probability of sampling sequence x,yx, y can be rewritten as

L(x,y)=θΘL(x,y,Θ1=θ)=θΘL(x,Θ1=θ)L(yx,Θ1=θ)=L(x)L(yx,Θ1=θx).L(x, y) = \sum_{\theta \in \Theta} L(x, y , \Theta_1 = \theta) = \sum_{\theta \in \Theta} L(x, \Theta_1 = \theta) L(y|x, \Theta_1 = \theta) = L(x)L(y | x, \Theta_1 = \theta_x).

The last transition follows from (1)(1).

Altogether, we obtain

P(yx)=L(yx,Θ1=θx),P(y | x) = L(y | x, \Theta_1=\theta_x),

which means that the distribution of possible responses is equivalent to the situation, in which the intention behind the prompt was provided explicitly.

The notion of unambiguity isn’t applicable to natural languages. However, it is often assumed that natural language are ε\varepsilon-ambiguous - in other words, there exists ε[0,1]\varepsilon \in [0, 1], such that the intention can be inferred from the message with the probability of at least 1ε1 - \varepsilon (for all messages!). This assumption is natural - while ambiguity is present in our communication, it is usually efficiently overcome with redundancy. There are no truly meaningless or gibberish phrases in spoken languages.

To accommodate for ambiguity, we can generalize the result on intention inference to

P(yx)L(yx,Θ1=θx)ε(x),|P(y | x) - L(y | x, \Theta_1=\theta_x)| \leq \varepsilon(x),

where ε(x)\varepsilon(x) denotes the ambiguity of the prompt. For ε\varepsilon-ambiguous languages, it is always upper-bounded by ε\varepsilon.

It can also be proven that if we provide more messages conveying the same intention as inputs, then the ambiguity decreases exponentially. This formalizes the observation that redundancy in the prompt increases the reliability of the results.

On the other hand, in human communication, the ambiguity of a message is decreased by external factors, such as body language or shared experience and knowledge. There is clearly no way to replicate these circumstances in LLMs, so this might be considered a shortcoming.

V. In-context learning

(back to ToC ↑)

Capitalizing on results from the previous section, one can quickly derive a formula for efficiency of few-shot prompting. Assume we are prompting the LLM for a completion of the input im+1i_{m+1} with instruction xx, providing mm example input-output pairs (i1,o1),,(im,om)(i_1, o_1), \ldots, (i_m, o_m), which are all trying to convey the same intention θ\theta_*. For an ε\varepsilon-ambiguous language, it holds

P(yx,i1,o1,,im,om)L(yx,Θ1=θ)εmε(x)ε(im+1)εm+2,|P(y|x,i_1, o_1, \ldots, i_m, o_m) - L(y | x, \Theta_1 = \theta_*)| \leq \varepsilon ^ m \cdot \varepsilon(x) \cdot \varepsilon(i_{m+1}) \leq \varepsilon ^ {m + 2},

which clearly underlines the significance of the examples, while demonstrating that the point of diminishing returns with respect to their number usually arrives relatively quickly.

VI. Chain-of-thought prompting

(back to ToC ↑)

Discussions surrounding genuity of reasoning capabilities of LLMs are still ongoing, and the latent space model is somewhat orthogonal to these considerations (though it’s slightly leaning towards the hypothesis of brute-force memoization of causal transitions). However, it does make it apparent how models are capable of capitalizing on their superior accuracy of simpler logical transitions.

By I(θkθ1,,θk1)I(\theta_k \vert \theta_1, \ldots, \theta_{k - 1}) we denote the probability of a logical transition, and we define it is as

I(θkθ1,,θk1)=I(θ1,,θk)I(θ1,,θk1).I(\theta_k \vert \theta_1, \ldots, \theta_{k - 1}) = \frac{I(\theta_1, \ldots, \theta_k)}{I(\theta_1, \ldots, \theta_{k - 1})}.

When k<hk < h, I(θ1,,θk)I(\theta_1, \ldots, \theta_k) is defined analogously to L(x1,,xk)L(x_1, \ldots, x_k).

For simplicity, consider the few-shot chain-of-thought prompting technique and assume we’re providing examples showcasing a coherent chain-of-thought θ1θk{\theta_1 \rightarrow \ldots \rightarrow \theta_k}. Let x1x_1 be the input message, and let xkx_k be an example message reaching the correct conclusion. When zero-shot prompting directly for a conclusion, the probability of the model returning the correct answer is equivalent to

P(xkx1)=L(x1,xk)L(x1)=I(θ1,θk)M(x1θ1)M(xkθk)I(θ1)M(x1θ1)=I(θkθ1)M(xkθk).P(x_k | x_1) = \frac{L(x_1, x_k)}{L(x_1)} = \frac{I(\theta_1, \theta_k)M(x_1 | \theta_1)M(x_k | \theta_k)}{I(\theta_1)M(x_1 | \theta_1)} = I(\theta_k| \theta_1)M(x_k | \theta_k).

Most notably, it the transition θ1θk\theta_1 \rightarrow \theta_k is under-represented in the training corpus, this probability is low. This wouldn’t change even if we considered all of correct xkx_k simultaneously.

In the case of chain-of-thought, this probability changes to (at least)

P(xkx1)(1εm+2)k1I(θkθ1,,θk1)M(xkθk),P(x_k | x_1) \geq (1 - \varepsilon^{m + 2})^{k - 1} \cdot I(\theta_k|\theta_1, \ldots, \theta_{k - 1}) \cdot M(x_k|\theta_k),

We control the number of examples, so we can easily assume that the first term is close to 1. The middle term is capturing the soundness of the showcased chain-of-thought with respect to the training corpus of the LLM. As empirical experiments show, for many different applications the task of finding such is perfectly viable.

VII. Conclusion

(back to ToC ↑)

While explanations of proficiency of LLMs are still speculative, mostly due to the insurmountable volume of parameters and training corpora, our understanding of their abilities and limitations are continuously increasing. A formal proof that transformer-based LLMs are capturing the marginal distributions in the ideal latent space model is a tangible confirmation of some breakthrough, as this is a decades-old problem, with many documented, fruitless attempts. In addition, it provides a neat explanation of multiple advantages of LLMs, which might not be definitive, but is definitely intuitive.

Get in touch

We can be your team of problem solvers.
We'd love to hear from you.
Contact us!