Intuition behind self-attention

In this post I want to share the intuition around a core mechanism in modern AI models. This post doesn't require technical knowledge, but if you have technical knowledge you might still find the intuition helpful!

This core mechanism ("algorithm") is called self-attention. The algorithm is only a tiny part of any model, but it is striking that most of state of the art AI uses it (as of 2024).

Architecture of transformers and attention blocks, from
Architecture of transformers and attention blocks, from "Attention is all you need" paper

The intuition: select important parts of the original content

you can think of self-attention as an algorithm that creates a summary of some information that is passed to a model ("input data", and it can be anything - images, text, sounds, numbers). It's clever in that the summary is not created from scratch, rather it's made up of passages of the original content it's summarising by assigning weights to parts of the original content according to how important they are.

Example: I could summarize the paragraph above by using the words in bold, rather than creating a summary that uses different words -> "self-attention creates a summary of some information by assigning weights to parts of the original content according to how important they are"

The cleverness here is that the algorithm can figure things out autonomously without external supervision.

How are important parts selected

The way importance is assigned to certain parts of the information is quite simple: the algorithm just selects the information that is most representative of the all dataset.

If we are applying self-attention to text, the most representative words in a sentence can simply words that are most similar with others around it, being those best suited at distilling the core meaning behind a sentence.

Example: in the sentence

"Cats and dogs, like all pets, like cuddles"

the word pet represents well both cats and dogs, which are hence redundant.
We could then summarise the sentence above as "pets like cuddles".

This is clearly a very rough approximation of what a real summary would look like - but it works like a charm!

Why it's a summary helpful

The summary essentially highlights key pieces of information and discards the rest removing some noise. This allows a model that uses it to find important information out of large amount of data.

Interestingly this is the same philosophy behind search algorithms and the page rank algorithm, which has been an important of Google Search.

Read more

There's plenty of information, videos and material on the web for those that want to dig deeper. Here I will just point you to the key original paper on attention: Attention is all you need

If you have any thoughts on this, I am keen to hear them - leave a comment!

Best,
Andrea