LLMs Aren't World Models

57 points by ingve a day ago

This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

AyyEye 18 hours ago

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
- BobbyJo 15 hours ago
  
  The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
  It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
  In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
  I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/
  LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
  
  vrighter 8 hours ago
  
  That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
  Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
- libraryofbabel 15 hours ago
  
  > they clearly don't have any world model whatsoever
  Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
- andyjohnson0 18 hours ago
  
  > With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
  Is this a real defect, or some historical thing?
  I just asked GPT-5:
  How many "B"s in "blueberry"?
  and it replied:
  There are 2 — the letter b appears twice in "blueberry".
  I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.
  
  libraryofbabel 17 hours ago
  
  It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
  
  pydry 17 hours ago
  
  Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.
  
  ThrowawayR2 17 hours ago
  
  It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908
  Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.
  
  nosioptar 17 hours ago
  
  Shouldn't the correct answer be that there is not a "B" in "blueberry"?
  
  bgwalter 17 hours ago
  
  It is not historical:
  https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...
  Perhaps they have a hot fix that special cases HN complaints?
  
  AyyEye 13 hours ago
  
  They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.
- yosefk 18 hours ago
  
  Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
yosefk 18 hours ago

Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
- libraryofbabel 17 hours ago
  
  I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.
  I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
  A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
  I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
  * LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
  * We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
  * I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
  * Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
  I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
  Anyway thanks for writing this and responding!
  
  yosefk 17 hours ago
  
  I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."
  I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
armchairhacker 18 hours ago

Any suggestions from this literature?
- libraryofbabel 14 hours ago
  
  The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.
lossolo 16 hours ago

https://arxiv.org/abs/2508.01191

jonplackett 17 hours ago

I just tried a few things that are simple and a world model would probably get right. Eg

Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.

In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?

Answer (after choosing thinking mode)

No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.

It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency

RugnirViking 17 hours ago

this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works

keeda 15 hours ago

That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.

I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)

And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.

ej88 17 hours ago

This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358

yosefk 17 hours ago

A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.

lordnacho 17 hours ago

Here's what LLMs remind me of.

When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.

Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.

But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.

In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.

In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.

This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.

The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.

skeledrew 13 hours ago

Agree in general with most of the points, except

> but because I know you and I get by with less.

Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.

imenani 18 hours ago

As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).

yosefk 18 hours ago

ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
- imenani 16 hours ago
  
  Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.
- red75prime 16 hours ago
  
  My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.
  How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
  The current models lack "remember/use/update" parts.
- red75prime 16 hours ago
  
  > I don't think there's any fundamental difference in the principle of their operation
  Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
  That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
  But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
lowsong 17 hours ago

It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.

t0md4n 19 hours ago

https://arxiv.org/abs/2501.17186

yosefk 19 hours ago

This is interesting. The "professional level" rating of <1800 isn't, but still.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.

og_kalu 18 hours ago

Yes LLMs can play chess and yes they can model it fine

https://arxiv.org/pdf/2403.15498v2

deadbabe 18 hours ago

Don’t: use LLMs to play chess against you

Do: use LLMs to talk shit to you while a real chess AI plays chess against you.

The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.

rishi_devan 18 hours ago

Haha. I enjoyed that Soviet-era joke at the end.

svantana 18 hours ago

Yes, I hadn't heard that before. It's similar in spirit to this norwegian folk tale about a deaf man guessing what someone is saying to him:
https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
- kgwgk 17 hours ago
  
  Another similar story:
  King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
  Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
  Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
  The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
  One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
  https://archive.org/details/advancedstoriesf0000hill

GaggiX 18 hours ago

https://www.youtube.com/watch?v=LtG0ACIbmHw

Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.

tickettotranai 17 hours ago

Technically yes, but... it's moderately tricky to get an LLM to play good chess even though it can.
https://dynomight.net/more-chess/
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...

Razengan 16 hours ago

A slight tangent: I think/wonder if the one place where AIs could be really useful, might be in translating alien languages :)

As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..

..though the potential for catastrophic disasters is also great there lol