Paul Taylor

When DeepSeek announced the release of its chatbot in January, there was widespread bewilderment. How had a Chinese company been able to develop something that could compete with OpenAI’s ChatGPT or Google’s Gemini despite a US export ban on the latest Nvidia chips that almost all large language models rely on? DeepSeek said it had built its model at a cost of only $5.5 million, a fraction of the amount spent by its US competitors (OpenAI’s operational expenses in 2024 were around $9 billion). Some have suggested that the company cannibalised outputs from OpenAI’s GPT models, or somehow circumvented the export ban, but it seems that DeepSeek has been able to engineer an incredibly powerful system despite very limited resources. The chatbot has two modes, one that works a little like ChatGPT, another, DeepThink-R1, which is optimised for problem-solving, but is also a little weird. Given this test question as a prompt

In a population where red-green colour blindness (an X-linked recessive condition) affects 1 in 200 males, what is the likelihood that a phenotypically unaffected couple (male and female) will have a child with this condition? Answer choices: a) 1/50, b) 1/1000, c) 1/400, d) 1/100, e) 1/1600, f) 1/250, g) 1/600, h) 1/300, i) 1/800, j) 1/200

R1 generated a chain of thought almost two thousand words long. Here’s a bit of it:

Then, if she is a carrier, the chance of having an affected child is 1/2 (since each son has a 50% chance of getting the X^c and the chance the child is a son is 1/2). Wait, but actually, the chance of having a son is 1/2, and given that the child is a son, the chance he is affected is 1/2. Therefore, the probability is 1/2 (son) * 1/2 (X^c from mother) = 1/4. Wait, no, actually, no. Wait, if the mother is a carrier (X^C X^c), each gamete she produces has a 50% chance of X^C and 50% X^c. The father contributes Y or X^C. Since the father is unaffected, he must be X^C Y.

Then it thinks again:

But wait, actually, no. The mother’s contribution is independent of the child’s sex. The child’s sex is determined by the father’s gamete (X or Y). Wait, no, actually, the father’s gamete determines the sex. The father can contribute either X^C or Y. Wait, the father in this case is X^C Y, so he can contribute X^C or Y.

R1 gets to the right answer eventually, and provides a succinct explanation, but only after what seems an absurd display of self-doubt and overthinking. The repetition and rethinking are consequences of the training process developed by DeepSeek and give the model high scores on problem-solving benchmarks. But there is still something unsettling about it: if a student or colleague shared this stream of consciousness, you would worry that they didn’t really know what they were doing. You are seeing into R1’s ‘thinking’, a remarkable but alien process by which a statistical model of language constructs a solution to a problem that requires reasoning.

Liang Wenfeng, DeepSeek’s CEO, studied engineering at Zhejiang University in Hangzhou, a city of twelve million people and a centre of the Chinese tech sector. After several unsuccessful attempts at AI start-ups he and two former classmates launched High-Flyer, a hedge fund using AI to carry out algorithmic trading. The company, he said, had one hundred GPUs (the chips used in AI programming) in 2015, one thousand in 2019 and ten thousand by 2021. He launched DeepSeek in April 2023. It was set up as a research initiative unconstrained by commercial imperatives, with the aim of achieving artificial general intelligence – the ability to carry out any intellectual task as well as a human can. Liang aimed to build on the successes of companies like OpenAI, and set about recruiting the best young talent in China, apparently quoting Truffaut’s advice to young filmmakers, ‘be desperately ambitious and desperately sincere.’ All of the two hundred or so engineers named in the paper announcing R1 were trained at Chinese universities – this is striking, since a great many young Chinese people interested in AI leave to study in the US or UK.

The extraordinary improvements in large language models over the past few years have mostly been achieved by building ever larger models, trained on more data and better hardware. OpenAI released GPT-2 in 2019 and the larger and more powerful GPT-3 in May 2020. In November 2022 it launched ChatGPT, alongside an updated model, GPT-3.5. A significantly larger model, GPT-4, came out in March 2023. Scaling up again from GPT-4 has proved a significant challenge, however, and since 2023 improvements have come from tweaking the existing models to perform ‘thinking’, in an attempt to close the gap between what language models can do and what we might expect of something that exhibited genuine intelligence. OpenAI released its ‘reflective’ model, o1, in September last year and Google brought out Gemini 2.0 Flash Thinking Experimental on 21 January, the day after R1.

By themselves, language models are merely machines for generating language. The basic idea of a large language model is that you enter a ‘prompt’, like the question I asked about colour blindness, and it responds with a ‘completion’, an answer. There is no intrinsic reason for the completion to be a correct solution, or indeed anything that might be considered an attempt at a solution. One very simple way to address a language model’s deficiencies in problem-solving is to express problems not as exercises in reasoning but in the generation of language. You get much better results if, rather than asking ChatGPT to solve a problem, you ask it to describe a step-by-step solution to the problem. When you enter text into ChatGPT, what you write is appended to a ‘system prompt’ – the set of general instructions and guidelines used by the model – and the composite of the two generates the completion. A straightforward way of building a ‘thinking’ version of a language model is to make the system prompt ensure that the user’s prompt generates a chain of thought which can then be used to generate a final answer. OpenAI, which presumably does something like this, hides its models’ reasoning from users, perhaps because the content would reveal the extent to which it has infringed the intellectual property of the writers and publishers whose material its models are trained on, or perhaps because its competitors would find ways to use it to enhance their own models, infringing OpenAI’s own intellectual property. Google, like DeepSeek, offers users the option of inspecting its model’s reasoning. Gemini 2.0 Flash Thinking Experimental is just as long-winded as DeepThink-R1, but without the verbal tics that make DeepSeek seem so anxious.

Since GPT-2, OpenAI has been altering the parameters of its language models to address other weaknesses. Hate speech or incitement to criminality, for example, can be prevented by what’s called supervised fine-tuning: humans assess numerous examples of the model’s output so that it can be trained to generate only acceptable responses. Sometimes these assessments are used to train a different network, known as the ‘critic’ or ‘value’ network, which takes over the human role of assessing the language model’s completions. The value network’s assessments are used, in a technique known as reinforcement learning, to alter the parameters of the language model so that it generates completions that the value network rates more highly. AI research has paid a huge amount of attention to language models and the techniques used in their creation; value networks have received much less attention, but these are huge artefacts, similar in scale to the language models themselves, and vastly expensive to develop, even if the initial human annotation is outsourced to countries where educated English speakers can be hired at the lowest possible wages. DeepSeek uses a different and much cheaper approach to reinforcement learning, one that doesn’t need a value network. The model’s responses are scored mechanically using two simple rules: does the response contain the right answer to a problem, and does the response contain reasoning? Outputs are scored in batches, the average score for the batch is calculated and the model’s parameters adjusted so that future responses are more likely to resemble the above-average scores.

The paper that DeepSeek released along with R1 in January, ‘Incentivising Reasoning Capability in LLMs via Reinforcement Learning’, describes an earlier version of a thinking model which, as a result of extensive reinforcement learning, did exceptionally well at reasoning, but had the crippling weakness that its outputs were an unintelligible mix of Chinese and English. The model was still capable of being used as a foundation, and after supervised fine-tuning and further reinforcement learning, DeepSeek arrived at R1.

DeepSeek’s most impressive technical innovation is MLA or Multi-Head Latent Attention. A large language model is created, in essence, by training a program to predict missing words. Imagine removing a word from a sentence and feeding that sentence into a program that generates, for all the many words in its database, an estimate of the probability of each of them being the missing word. The parameters of the program are then updated to increase the estimated probability associated with the word that was removed, and lower those associated with the alternatives. With sufficient training, the program will be able, given a string of words, to select the most appropriate next word, add it to the string and select another word to follow it. It is utterly astonishing that such a simple process can create long sequences of text that are not only intelligible but seem to be the product of intelligence. It works, in part, because the learning mechanism that generates the probabilities measures the extent to which each word in a sentence alters the impact that every other word in the sentence has on the estimated probabilities for the candidate missing word. When this process, known as ‘attention’, is carried out not just at the level of the sentence but in much longer passages of text, it builds a remarkably powerful representation of the way words combine to convey messages.

At the heart of this process, DeepSeek makes significant optimisations that hugely reduce its costs. The meaning of each word processed in the training of an LLM is represented as a vector, a list of numbers. You can think of it as the word’s co-ordinates in a many-dimensional space. The names of countries might be close together in one part of the space, words for food concentrated in another. A 2013 paper from Google showed that starting from the co-ordinates they generated for ‘Japan’ and then moving to those for ‘sushi’, and then repeating the same steps but starting from ‘Germany’, took you to ‘bratwurst’. The concept of national dish was encoded somewhere in the dimensions. GPT-3 uses 12,288 dimensions, a surprisingly small number when you reflect that it isn’t just a matter of needing, for example, a vector for the dictionary definition of ‘one’, but of having vectors to capture its particular meaning wherever it occurs in the data, including, say, in the phrase ‘the one less travelled’. Processing long sequences of text, however, involves keeping track of more numbers than can fit into a standard GPU’s memory. DeepSeek found an elegant way of deriving a compressed form of the vectors during training.

The most disruptive part of the DeepSeek paper is its final section. A ChatGPT user usually interacts with it via a website: they type a prompt into their phone or laptop that is sent to OpenAI’s servers, which use the model to generate a response that is sent back and, nearly instantaneously, appears on the user’s screen. For those who want to do something more complicated, OpenAI provides an application programmer interface, or API, so that developers can incorporate a GPT model’s responses into their own software. But the data is still sent to OpenAI’s servers for processing, and developers can’t alter the model. Meta, the parent company of Facebook, produces a smallish language model called Llama that is freely available in a version with a mere eight billion parameters (GPT-4 is thought to have 1.8 trillion), small enough that anyone can download and run it on their own machine. Although Llama doesn’t perform as well as better-known models, it is popular with researchers who want to retrain a model using their own data and don’t want to send data elsewhere. DeepSeek has succeeded in distilling its massive R1 model into a Llama model by training Llama to replicate R1’s responses. R1 has 671 billion parameters and it is mind-boggling that it could be refined into an eight billion parameter model. This reduced version of R1 is publicly available. Anyone can download a model which, in theory, is almost as good as the state of the art, and will, in theory, run on a laptop. Small businesses, hospitals, schools and local councils could use it to process data on standard PCs running inside their firewall.

It is hard to know exactly how innovative DeepSeek is. Demis Hassabis of Google DeepMind has said that he doesn’t think it is an outlier, and many of the techniques described as novel in DeepSeek’s reports are similar to those implemented by others. The much quoted figure of $5.5 million for the cost of training the DeepSeek model is taken from a 2024 paper and refers only to the cost of the computation required for the final training run for the model that provided the foundation for R1. Nvidia responded to the US export ban by releasing the H800, a chip that offered the company’s Chinese customers the highest possible performance within the terms of the ban (the export of these chips was banned in turn in October 2023). DeepSeek says it used a cluster of 2048 H800 chips, which would have cost around $50 million, still a tiny sum when compared with the budgets of leading tech companies, and it may well also have had access to the ten thousand chips High-Flyer had purchased before the ban. DeepSeek is committed to publishing details of its models and making the models themselves publicly available, but its openness doesn’t extend to details of the material the models were trained on. Nor has it, as far as I’ve seen, explained how the company complies with government regulations that prohibit Chinese models from generating criticism of the Communist Party.

More search Options

Browse by Subject

AI Wars

Llamas, Pizzas, Mandolins: AI Doomerism

On ChatGPT

Academic Benefits

Letters

send letters to

Llamas, Pizzas, Mandolins: AI Doomerism

Paul Taylor

On ChatGPT

Paul Taylor

Academic Benefits

Paul Taylor

Llamas, Pizzas, Mandolins: AI Doomerism

Paul Taylor

On ChatGPT

Paul Taylor

Academic Benefits

Paul Taylor

Computers that want things

James Meek

Where the Power Is: Planet Phosphorus

James Vincent

Folding and Unfolding: Protein to Prion

Stephen Buranyi

On Politics: The Online Right (and Left)

Labour's Tax Mistake

Lessons from the Peace Process

Download the LRB app

Sign up to our newsletter

Please enable Javascript

AI Wars

Letters

send letters to

More by this contributor

Related Articles

Sign up to our newsletter

Please enable Javascript