As much as I've agreed with the author's other posts/takes, I find myself resisting this one:
> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.
No, that does not follow.
1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.
2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.
However you do it, somebody else should still be reviewing the change afterwards.
> 2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht. LLMs will just outright lie to make their jobs easier in one section while in another area generate high quality code.
I've had to do a 'git reset --hard' after trying out the Claude code and spending $20 bucks. It always seems great at first, but it just becomes non-sense on larger changes. Maybe chain of thought models do better though.
You can see the patterns a.k.a. "code smells"[0] in code 20x faster than you can write code yourself.
I can browse through any Java/C#/Go code and without actually reading every keyword see how it flows and if there's something "off" about how it's structured. And if I smell something I can dig down further and see what's cooking.
If your chosen language is difficult/slow to read, then it's on you.
And stuff should have unit tests with decent coverage anyway, those should be even easier for a human to check, even if the LLM wrote them too.
My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.
Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.
Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.
Do you not review code from your peers? Do you not search online and try to grok code from StackOverflow or documentation examples?
All of these can vary wildly in quality. Maybe its because I mostly use coding LLMs as either a research tool, or to write reasonably small and easy to follow chunks of code, but I find it no different than all of the other types of reading and understanding other people's code I already have to do.
- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.
- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.
- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.
In my experience you just don't keep as good a map of the codebase in your head when you have LLMs write a large part of your codebase as when you write everything yourself. Having a really good map of the codebase in your head is what brings you large productivity boosts when maintaining the code. So while LLMs do give me a 20-30% productivity boost for the initial implementation, they bring huge disadvantages after that, and that's why I still mostly write code myself and use LLMs only as a stackoverflow alternative.
I feel like “looks like it’s written by AI” might become a critique of writing that’s very template-like, neutral, corporate. I don’t usually dislike it though, as long as the information is there.
Yes, I prefer using lists myself, too, does not mean my writing is being influenced by AI. I have always liked bullet points long before AI was even a thing, it is for better organization and visual clarity.
I think this is a great line:
> My fear is that LLM generated code will look great to me, I won't understand it fully but it will work
This is a degree of humility that made the scenario we are in much clearer.
Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.
Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.
It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.
This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.
Worst part is that the patterns of implementation won't be consistent across the pieces. So debug a whole codebase that was authored with LLM generated code is like having to debug a codebase where ever function was written by a different developer and no one followed any standards. I guess you can specify the coding standards in the prompt and ask it to use FP-style programming only, but I'm not sure how well it can follow.
It is supposed to follow that instruction though. When it generates code, I can tell is to use tabs, 2 spaces, etc. and the generated code will use that. It works well with Claude, at least.
> But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws.
Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.
Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.
When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?
Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.
For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.
Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.
Then you'll get code that passes the tests you generate, where "tests" includes whatever you feed the fuzzer to detect problems. (Just crashes? Timeouts? Comparison with a gold standard?)
Sorry, I'm failing to see your point.
Are you implying that the above is good enough, for a useful definition of good enough? I'm not disagreeing, and in fact that was my starting assumption in the message you're replying to.
Crap code can pass tests. Slow code can pass tests. Weird code can pass tests. Sometimes it's fine for code to be crap, slow, and/or weird. If that's your situation, then go ahead and use the code.
To expand on why someone might not want such code, think of your overall codebase as having a time budget, a complexity budget, a debuggability budget, an incoherence budget, and a maintenance budget. Yes, those overlap a bunch. A pile of AI-written code has a higher chance of exceeding some of those budgets than a human-written codebase would. Yes, there will be counterexamples. But humans will at least attempt to optimize for such things. AIs mostly won't. The AI-and-AI-using-human system will optimize for making it through your lint-fuzz-test cycle successfully and little else.
Different constraints, different outputs. Only you can decide whether the difference matters to you.
> Then you'll get code that passes the tests you generate
Just recently I think here on HN there was a discussion about how neural networks optimize towards the goal they are given, which in this case means exactly what you wrote, including that the code will do stuff in wrong ways just to pass the given tests.
Where do the tests come from? Initially from a specification of what "that thing" is supposed to do and also not supposed to do. Everyone who had to deal with specifications in a serious way knows how insanely difficult it is to get these right, because there are often things unsaid, there are corner cases not covered and so on. So the problem of correctness is just shifted, and the assumption that this may require less time than actually coding ... I wouldn't bet on it.
To fight this I mostly do ping-pong pairing with llms. After e discuss the general goal and approach I usually write the first test. The llm the makes it pass and writes the next test which I'll make pass and so on. It forces me to stay 100% in the loop and understand everything. Maybe it's not as fast as having the llm write as much as possible but I think it's a worthwhile tradeoff.
The big argument against it is, at some point, there’s a chance, that you won’t really need to understand what the code does. LLMs writes code, LLMs write tests, you find bugs, LLM fixes code, LLM adds test cases for the found bug. Rinse and repeat.
> My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.
If you don’t understand it, ask the LLM to explain it. If you fail to get an explanation that clarifies things, write the code yourself. Don’t blindly accept code you don’t understand.
This is part of what the author was getting at when they said that it’s surfacing existing problems not introducing new ones. Have you been approving PRs from human developers without understanding them? You shouldn’t be doing that. If an LLM subsequently comes along and you accept its code without understanding it too, that’s not a new problem the LLM introduced.
Code reviews with a human are a two way street. When I find code that is ambiguous I can ask the developer to clarify and either explain their justification or ask them to fix it before the code is approved. I don’t have to write it myself, and if the developer is simply talking in circles then I’d be able to escalate or reject—and this is a far less likely failure case to happen with a real trusted human than an LLM. “Write the code yourself” at that point is not viable for any non-trivial team project, as people have their own contexts to maintain and commitments/projects to deliver. It’s not the typing of the code that is the hard part which is the only real benefit of LLMs that they can type super fast, it’s fully understanding the problem space. Working with another trusted human is far far different from working with an LLM.
No one takes the time to fully understand all the PRs they approve. And even when you do take the time to “fully understand” the code, it’s very easy for your brain to trick you into believing you understand it.
At least when a human wrote it, someone understood the reasoning.
It happens all the time. Way before LLM. There were countless times I implemented an algorithm from a paper or a book while not fully understanding it (in other words, I can't prove the correctness or time complexity without referencing the original paper).
> if you don't understand it fully, how can you say that it will look great to you, and that it will work?
Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.
A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.
Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.
> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!
I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.
I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.
I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.
Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.
100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D
Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.
Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.
If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.
>I think that's a net-benefit compared to human-only generated code.
Net-benefit in what terms though? More productive WRT raw code output? Lower error rate?
Because, something about the idea of generating tons of code via LLMs, which humans have to then verify, seems less productive to me and more error-prone.
I mean, when verifying code that you didn't write, you generally have to fully reason through it, just as you would to write it (if you really want to verify it). But, reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
OTOH, if you just breeze through it because it looks correct, you're likely to miss errors.
The latter reminds me of the whole "Full self-driving, but keep your hands on the steering wheel, just in case" setup. It's going to lull you into overconfidence and passivity.
> "Full self-driving, but keep your hands on the steering wheel, just in case" setup
This is actually a trick though. No one working on self driving actually expects people to actually babysit it for long at all. Babysitting actually feels worse than driving. I just saw a video on self-driving trucks and how the human driver had his hands hovering on the wheel. The goal of the video is to make you think about how amazing self-driving rigs will be, but all I could think about was what an absolutely horrible job it will be to babysit these things.
Working full-time on AI code reviews sounds even worse. Maybe if it's more of a conversation and you're collaboratively iterating on small chunks of code then it wouldn't be so bad. In reality though, we'll just end up trusting the AI because it'll save us a ton of money and we'll find a way to externalize the screw ups.
Also after reasonable period if you are stuck you can actually ask them what were they thinking and why was it written that way and what are the constrains they thought of.
And you can discuss these, with both of you hopefully having experience in the domain.
Exactly. And, if correction is required, then you either re-write it or you're stuck maintaining whatever odd way the LLM approached the problem, whether it's as optimal (or readable) as a human's or not.
If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.
If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.
Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.
They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:
This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.
In my own experience, I have all sorts of ways that I try to 'drag' the llm out of some line of 'thinking' by editing the conversation as a whole, or just restarting the whole prompt, and I've been kind of just doing this over time since GPT3.
While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.
I don't agree. What if the LLM takes a two-step approach, where it first determines a global architecture, and then it fills in the code? (Where it hallucinates in the first step).
Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.
I don't know if the model changed in the last six months, or maybe the wow factor has worn off a bit, but it also feels like ChatGPT has become a lot more "people-pleasy" than it was before.
I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.
I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.
Yeah, I think it's coded to be super-conciliatory as some sort of apology for its hallucinations, but I find it annoying as well. Part of it is just like all automated prompts that try to be too human. When you know it's not human, it's almost patronizing and just annoying.
But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.
It's obvious, isn't it? The average Hacker News user, who has converged to the average Internet user, wants exactly that experience. LLMs are pretty good tools but perhaps they shouldn't be made available to others. People like me can use them but others seem to be killed when making contact. I think it's fine to restrict access to the elite. We don't let just anyone fly a fighter jet. Perhaps the average HN user should be protected from LLM interactions.
Is that really what you got from what I wrote? I wasn't suggesting that we restrict access to anyone, and I wasn't trying to imply that I'm somehow immune to the problems that were highlighted.
I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.
I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.
It looked like you were being sarcastic, implying I was trying to suggest that I thought I was better than the average person in regards to handling AI. Particularly this line:
> People like me can use them but others seem to be killed when making contact.
Yeah, no, 100% sincere personal view. That guy who killed himself after using it is obviously not ready for this. Imagine killing yourself after typing in `print("Kill yourself")` at the Python REPL. The guy's an imbecile. We don't let just anyone drive a truck. I'm fine with nearly everyone being on the outside and unable to use these tools so long as I'm allowed to with as little trouble as possible.
I recognize that the view that others should not be permitted things that I should be allowed to use is generally a sarcastically expressed view, but I genuinely think it has merit. Everyone who believes these things are dangerous and everyone to whom this is obviously dangerous, like the aforementioned mentally deficient individual, shouldn't be permitted use.
More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.
I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.
Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.
>Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.
Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..
If you believe current models exist at the limit of possible persuasiveness, there obviously isn't any cause for concern.
For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.
The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"
> [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]
Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.
Yeah...this. I'm not so concerned that AI is going to put me out of a job or become Skynet. I'm concerned that people are offloading decision making and critical thinking to the AI, accepting it's response at face value and responding to concerns with "the AI said so...must be right". Companies are already maliciously exploiting this (e.g. the AI has denied your medical claim, and we can't tell you how it decided that because our models are trade secrets), but it will soon become de rigueur and people will think you're weird for questioning the wisdom of the AI.
The combination of blind faith in AI, and good faith highly informed understanding and agreement, achieved with help of AI, covers the full spectrum of the problem.
In both of your linked examples, the people in question very likely had at least some sort of mental instability working in their minds.
I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.
By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.
Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.
The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.
We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".
You'd be surprised. I know a few people who couldn't really code before LLMs, but now with LLMs they can just brute-force through problems. They seem pretty undetered about 'trusting' the solution, if they ran it and it worked for them, it gets shipped.
The backlash will be enormous. In the near future, there will be less competent coders and a tsunami of bad code to fix. If 2020 was annoying to hiring managers they have no idea how bad it will become.
Of course this will be the case, but probably not for the reasons you are concerned about. It is because a lot of people have been enabled by these tools to realize they are able to do things they thought were beyond them.
The opaque wall that separates the solution from the problem in technology often comes from the very steep initial learning curve. The reason most people who are developers now learned to code is because they had free time when they were young, had access to the technology, and were motivated to do it.
But as an adult, very few people are able to get past the first obstacles which keep them from eventually becoming proficient, but now they have a cheat code. So you will see a lot more capable programmers in the future who will be able to help you fix this backlog of bad code -- we just have to wait for them to gain the experience and knowledge needed before that happens and deal with the mistakes along the way.
This is no different from any other enabling technology. The people who feel like they had to struggle through it and pay their dues when it 'wasn't easy' are going to be resentful and try and gatekeep; it is only human nature.
> you have to put a lot of work in to learn how to get good results out of these systems
That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?
Many tools that increase your productivity as a developer take a while to master. For example, it takes a while to become proficient with a debugger, but I'd still wager that it's worth it to learn to use a debugger over just relying on print debugging.
An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.
The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.
I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.
I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.
This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.
IMX, quite a few Python users - including ones who think they know what they're doing - run into that same confusion, because they haven't properly understood fundamentals e.g. about how virtual environments work, or how to read documentation effectively. Or sometimes just because they've been careless and don't have good habits for ensuring (or troubleshooting) the environment.
If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.
I once submitted some code for review, in which the AI had inserted a recursive call to the same function being defined. The recursive call was completely unnecessary and completely nonsensical, but also not wrong per se - it just caused the function to repeat what it was doing. The code typechecked, the tests passed, and the line of code was easy to miss while doing a cursory read through the logic. I missed it, the code reviewer missed it, and eventually it merged to production.
Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.
fwiw this problem already exists with my more junior co-workers. and also my own code that I write when exhausted!
if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.
I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.
For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.
Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
>though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.
Sure, it's possible that the code it gave me was based on some incorrectly written code it scraped from Gitlab or something.
I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.
Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.
Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.
I am not a programmer and i don't use Linux. I've been working on a python script for a raspberry pi for a few months. Chatgpt has been really helpful in showing me how to do things or debug errors.
Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.
Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.
That's not quite right. The models are pretty bad at generating a proper diff, so there are two common formats used. The main one is a search and replace, and the search is then done in quite a fuzzy manner.
To be clear the diff they generate is something you or I could apply manually and wouldn't notice an issue. It's things like very minor whitespace issues, or more commonly the count saying how large the sections are - nothing that affects the meat of the diff, they're fine with the hard part but then there's small counting errors.
> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.
Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.
So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.
And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.
Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.
> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.
But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).
And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).
Prompt injection means that unless people using LLMs to generate code are willing to hunt down and inspect all dependencies, it will become extremely easy to spread malware.
> Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.
This is an appeal against innovation.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.
You get none of that from reviewing code generated by an LLM.
Even with boring tech that's been in the training set for ages (rails), you can get some pretty funny hallucinations: https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinat... (fortunately this one was the very non-dangerous kind, making it very obvious; though I wonder how many non-obvious hallucinations entered the training set by the same process)
One thing I've found is that while I work with a LLM and it can do things way faster than me, the other side of it is I'm quickly loosing understand of the deeper code.
If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.
At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.
For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
Um. No.
This is oversimplification that falls apart in any at minimum level system.
Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.
And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.
This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.
As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.
It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.
The idea is correct, a lot of people (including myself sometimes) just let an "agent" run and do some stuff and then check later if it finished. This is obviously more dangerous than just the LLM hallucinating functions, since at least you can catch the latter, but the first one depends on the tests of the project or your reviewer skills.
The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
>The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
That still seems useful when you don't already know enough to come up with good search terms.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)
And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.
Timely article. I really, really want AI to be better at writing code, and hundreds of reports suggest it works great if you're a web dev or a python dev. Great! But I'm a C/C++ systems guy(working at a company making money off AI!) and the times I've tried to get AI to write the simplest of test applications against a popular API it mostly failed. The code was incorrect, both using the API incorrectly and writing invalid C++. Attempts to reason with the LLMs(grokv3, deepseek-r1) led further and further away from valid code. Eventually both systems stopped responding.
I've also tried Cursor with similar mixed results.
But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.
It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.
However, this 'lets move past hallucinations' discourse is just disingenuous.
The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for.
...and people not spending enough time and effort learning to use the tools.
I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
People aren't stupid.
If they use a tool and it sucks, they'll stop using it and say "this sucks".
If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.
People are lazy; if the tool is good (eg. cursor), people will use it.
If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".
Which is literally what is happening. Right. Now.
To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out.
...but that's because it is a real problem
Pretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.
All the criticism of LLM’s code writing ability ignore the fact that its still true that the majority of programmers can’t write FizzBuzz http://imranontech.com/2007/01/24/using-fizzbuzz-to-find-dev...
This is the reason why LLM’s will have a large impact on software jobs. Those who can write FizzBuzz or more need not be worried. But they are a small minority.
I don't care about the programmers who can't write FizzBuzz. Why should I? If I employed them, they were costing me money. If I worked with them, they were costing me time and hair follicles. I need them about as much as I need a buggy whip.
The linked article makes the claim that the majority of comp sci majors cannot write FizzBuzz. That's a bold assertion; how did the author sample such people? I suspect the sample pool was people applying for a position. There is a major selection bias there. First, people who fail many interviews will do more interviews than those who do not fail, so you'll start with a built-in bias towards the less competent (or more nervous).
Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them. If getting a chance at jumping over those bars requires me to get a particular piece of paper with a particular title printed at the top of it, I'll be motivated to get that piece of paper too.
> Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them.
Why don't we see job positions for doctors and lawyers similarly flooded, then?
But in both cases, there just isn't some low bar that you can finagle your way over and get to the promised riches. Lawyers have a literal Bar, and it isn't low. Doctors have a ton of required training. Both have serious certification requirements that computer science professionals do not. Both professions support my point.
Furthermore, incompetent lawyers face real-world tests. If they lose their cases or otherwise screw things up, they are not going to be raking in the money. And people are trying their best to flood the doctor market, by inventing certifications that avoid the requirements to be a physician and setting themselves up as alternative medicine specialists or naturalists or generic "healers" or whatever. (I'm not saying they're all crap, but I am saying that unqualified people are flooding those positions.)
As a non programmer I only get little programs or scripts that do something from the LLM. If they do the thing it means the code is tested, flawless and done. I would never let them have to deal with other humans Input of course.
I am not so sure. Code by one LLM can be reviewed by another. Puppeteer like solutions will exist pretty soon. "Given this change, can you confirm this spec".
Even better, this can carry on for a few iterations. And both LLMs can be:
1. Budgeted ("don't exceed X amount")
2. Improved (another LLM can improve their prompts)
and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.
That's ok. Writing such a spec is writing the code, declaratively.
The only difference between that and writing SQL (as opposed to writing imperative code to query the database) is that the translation mechanism is much more sophisticated, much less energy efficient, much slower, and most significantly much more error-prone than a SQL interpreter.
But declarative coding is good! It has its issues, and LLMs in particular compound the problems, but it's a powerful technique when it works.
the user who wants it? and a premature retort: if the feedback is "the user / PM / stakeholder could be wrong", then... that's where we are. A "refiner" LLM can be fronted (Replit is playing with this for instance).
To be clear: this is not something I do currently, but my point is that one needs to detach from how _we_ engineers do this for a more accurate evaluation of whether these things truly do not work.
I'm excited to see LLMs get much better at testing. They are already good at writing unit tests (as always, you have to review them carefully). But imagine an LLM that can see your code changes _and_ can generate and execute automated and manual tests based on the change.
Great article, but doesn't talk about the potentially _most_ dangerous form of mistakes: an adversarial LLM trying to inject vulnerabilities. I expect this to become a vector soon as people figure out ways to accomplish this
Software is the manifestation of a solution to a problem.
Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.
I'm not remotely convinced that LLMs are a chainsaw, unless they've been very thoroughly trained on the problem domain. LLMs are good for vibe coding, and some of them (Grok 3 is actually good at this) can speak passable Latin, but try getting them to compose Sotadean verse in Latin or put a penthemimeral caesura in an iambic trimeter in ancient Greek. They can define a penthemimeral caesura and an iambic trimeter, but they don't understand the concepts and can't apply one to the other. All they can do is spit out the next probable token. Worse, LLMs have lied to me on the definition of Sotadean verse, not even regurgitating what Wikipedia should have taught them.
Image-generating AIs are really good at producing passable human forms, but they'll fail at generating anything realistic for dice, even though dice are just cubes with marks on them. Ask them to illustrate the Platonic solids, which you can find well-illustrated with a Google image search, and you'll get a bunch of lumps, some of which might resemble shapes. They don't understand the concepts: they just work off probability. But, they look fairly good at those probabilities in domains like human forms, because they've been specially trained on them.
LLMs seem amazing in a relatively small number of problem domains over which they've been extensively trained, and they seem amazing because they have been well trained in them. When you ask for something outside those domains, their failure to work from inductions about reality (like "dice are a species of cubes, but differentiated from other cubes by having dots on them") or to be able to apply concepts become patent, and the chainsaw looks a lot like an adze that you spend more time correcting than getting correct results from.
I asked o3-mini-high (investor paying for Pro, I personally would not) to critique the Developer UX of D3's "join" concept (how when you select an empty set then when you update you enter/exit lol) and it literally said "I'm sorry. I can't help you with that." The only thing missing was calling me Dave.
> I asked Claude 3.7 Sonnet "extended thinking mode" to review an earlier draft of this post [snip] It was quite helpful, especially in providing tips to make that first draft a little less confrontational!
So he's also using LLMs to steer his writing style towards the lowest common denominator :)
Yep. LLMs can get all the unit tests to pass. But not the acceptance tests. The discouraging thing is you might have all green checks on the unit tests, but you can’t get the acceptance tests to pass without starting over.
> Compare this to hallucinations in regular prose, where you need a critical eye, strong intuitions and well developed fact checking skills to avoid sharing information that’s incorrect and directly harmful to your reputation
Ah so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
Not just that, “it works” is a very, very low bar to have for your code. To illustrate, the other day I tested an LLM by having it create a REST API. I asked for an end point where I could update a particular field of the record (think liking a post).
Then I decided to add on more functionality and asked for the ability to update all the other fields…
As you can guess, it gave me one endpoint per field for that entity. Sure, “it works”…
> Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
I'm just here to whine, almost endlessly, that the word "hallucination" is a term of art chosen deliberately because it helps promote a sense AGI exists, by using language which implies reasoning and consciousness. I personally dislike this. I think we were mistaken allowing AI proponents to repurpose language in that way.
It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.
Errors are a category of well understood and explicit failures.
Slop is the best description. LLMs are sloppy tools and some people are not discerning enough to know that blindly running this slop is endangering themselves and others.
> My less cynical side assumes that nobody ever warned them that you have to put a lot of work in to learn how to get good results out of these systems
Why am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.
The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.
It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.
This is not nearly as strong an argument as the author thinks it is.
> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).
This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.
The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.
Also the thing that people miss is compounded experience. Just starting with any language, you have to read a lot of documentation, books, and articles. After a year or so, you have enough skeleton projects, code samples, and knowledge, that you could build a mini framework if the projects were repetitive. Even then, you could just copy paste features that you've already implemented, like that test harness or the Rabbitmq integration an be very productive that way.
1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.
2. So why not ask the LLM, right?
3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.
4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.
5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.
Personally I believe the worst with llm is it's abysmal ability to architect code, it's why I use llms more like a Google than a so called coding buddy, because there was so many times I had to rewrite the entire file because the llm had added in so much extra unmanageable functions,even deciding to solve problems I hadn't asked it to do.
Increasingly I see apologists for LLMs sounding like people justifying fortune tellers and astrologists. The confidence games are in force, where the trick involves surreptitiously eliciting all the information the con artist needs from the mark, then playing it back to them as if it involves some deep and subtle insights.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.
I don’t really understand what the point or tone of this article is.
It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.
It sure gives the impression that the article itself was written by an LLM and barely edited by a human.
> The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!
Humans can hallucinate up some API they want to call in the same way that LLMs can, but you don't call all human mistakes hallucinations; classifying everything LLMs do wrong as hallucinations would seem rather pointless to me.
I definitely wouldn't say I'm lying (...to.. myself? what? or perhaps others for a quick untested response in a chatroom or something) whenever I write some code and it turns out that I misremembered the name of an API. "Hallucination" for that might be over-dramatic but at least it it's a somewhat sensible description.
Maybe we should stop referring to undesired output (confabulation? Bullshit? Making stuff up? Creativity?) as some kind of input delusion. Hallucination is already a meaningful word and this is just gibberish in that context.
As best I can tell, the only reason this term stuck is because early image generation looked super trippy.
I think of hallucinations as instance where an LLM invents something that is entirely untrue - like a class or method that doesn't exist, or a fact about the world that's unnoticed true.
I guess you could call bugs in LLM code "hallucinations", but they feel like a slightly different thing to me.
My cynical side suspects they may have been looking for
a reason to dismiss the technology and jumped at the first
one they found.
MY cynical side suggests the author is an LLM fanboi who prefers not to think that hallucinating easy stuff strongly implies hallucinating harder stuff, and therefore jumps at the first reason to dismiss the criticism.
I find it a bit surprising that I'm being called an "LLM fanboy" for writing an article with the title "Hallucinations in code are the least dangerous form of LLM mistakes" where the bulk of the article is about how you can't trust LLMs not to make far more serious and hard-to-spot logic errors.
What do you mean by "harder stuff"? What about an experimental DSL written in C with a recursive descent parser and a web server runtime that includes Lua, jq, a Postgres connection pool, mustache templates, request-based memory arena, database migrations and much more? 11,000+ lines of code with ~90% written by Claude in Cursor Composer.
Frankly us "fanbois" are just a little sick and tired of being told that we must be terrible developers working on simple toys if we find any value from these tools!
I'm a strong believer that LLMs are tools and when wielded by talented and experienced developers they are somewhere in the danger category of Stack Overflow and transitive dependencies. This is not a critique of your project, or really the quality of LLMs, but when I see 90% of a 11,000+ loc project written in Claude, it just feels sort of depressing in a way I haven't processed yet.
I love foss, I love browsing projects of all quality levels and vintages and seeing how things were built. I love learning new patterns and sometimes even bickering over their strengths and weaknesses. An LLM generated code base hardly makes me even want to engage with it...
Perhaps these feelings are somewhat analogous to hardcopies vs ebooks? My opinions have changed over time and I read and collect both. Have you had similar thoughts and gotten over them? Do you see tools like Claude in a way where this isn't an issue?
You're romanticizing software. To place more value in the code than the outcome. There's nothing wrong with that, but most people that use software don't think about it that way.
Some free code review of the first file I clicked into - https://github.com/williamcotton/webdsl/blob/92762fb724a9035... among other places should probably be doing the conditional "lexer->line++"; thing. Quite a weird decision to force all code paths to manually do that whenever a newline char is encountered. Could've at least made a "advance_maybe_newline(lexer);" or so. But I guess LLMs give you copy-paste garbage.
Even the article of this thread says:
> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing.
Thanks for taking a look! The lexer and parser is probably close to 100% Claude and I definitely didn't review it completely. I spent most of the time trying out different grammars (normally something you want to do before you start writing code) and runtime features! "Build the web server runtime and framework into the language" was an idea kicking around in my head for a few years but until Cursor I didn't have the energy to play around with the idea.
Okay so this is a personal opinion right? Like where is the objectivity in your review?
What are the hardline performance characteristics being violated? Or functional incorrectness. Is this just "it's against my sensibilities" because at the end of the day frankly no one agrees on how to develop anything.
The thing I see a lot of developers struggle with is just because it doesn't fit your mental model doesn't make it objectively bad.
So unless it's objectively wrong or worse in a measurable characteristic I don't know that it matters.
For the record I'm not asserting it is right, I'm just saying I've seen a lot of critiques of LLM code boil down to "it's not how I'd write it" and I wager that holds for every developer you'll ever interact with.
OP didn't put much effort into writing the code so I'm certainly not putting in much effort into a proper review of it, for no benefit to me no less. I just wanted to see what quality AI gets you, and made a comment about it.
I'm pretty sure the code not having the "if (…) lexer->line++" in places is just a plain simple repeated bug that'll result in wrong line numbers for certain inputs.
And human-wise I'd say the simple way to not have made that bug would've been to make/change abstractions upon the second or so time writing "if (…) lexer->line++" such that it takes effort to do it incorrectly, whereas the linked code allows getting it wrong by default with no indication that there's a thing to be gotten wrong. Point being that bad abstractions are not just a maintenance nightmare, but also makes doing code review (which is extra important with LLM code) harder.
I’m always really sceptical of any “proof by example” that is essentially anecdotal.
If this is going to be your argument, you need a solid scientific approach. A study where N developers are given access to a tool vs N that are not, controls are in place etc.
Because the overwhelming majority of coders I speak to are saying exactly the same thing, which is LLMs are a small productivity boost. And the majority of cursor users, which is admittedly a much smaller number, are saying it just gets stuck playing whack a mole. And common sense says these are the expected outcomes, so we are going to need really rigorous work to convince people that LLMs can build 90% of most deeply technical projects. Exceptional results require exceptional evidence.
And when we do see anecdotal incidents that seem so divergent from the norm, well that then makes you wonder how that can be, is this really objective or are we in some kind of ideological debate?
Protip: when you block a user in github it let's you add a note as to why that will show in their profile. It will also alert you when you see a repository to which that user has contributed.
Honest question: this looks like a library others can use to build websites. It contains features related to authentication and security. If it's 90% LLM generated, how do you sleep at night? I'd be dead scared someone would use this, hit a bug that leaks PII (or worse) and then sue me into oblivion.
"WebDSL is an experimental domain-specific language and server implementation for building web applications."
And it's MIT:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Maybe be, after all - I dont write web servers (btw, the PQ and JQ libraries doesnt seem to use the arena allocator, which makes the whole proposition a bit dubious, but lets say that its me being picky).
What I meant was, that IMO the code is not very robust when dealing with memory allocations:
"People took a cursory look at a codebase I published and found glaring mistakes they discussed publicly as examples of how bad it is" is not the flex you think it is.
[Recycled from an older dupe submission]
As much as I've agreed with the author's other posts/takes, I find myself resisting this one:
> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.
No, that does not follow.
1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.
2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.
However you do it, somebody else should still be reviewing the change afterwards.
> 2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht. LLMs will just outright lie to make their jobs easier in one section while in another area generate high quality code.
I've had to do a 'git reset --hard' after trying out the Claude code and spending $20 bucks. It always seems great at first, but it just becomes non-sense on larger changes. Maybe chain of thought models do better though.
You can see the patterns a.k.a. "code smells"[0] in code 20x faster than you can write code yourself.
I can browse through any Java/C#/Go code and without actually reading every keyword see how it flows and if there's something "off" about how it's structured. And if I smell something I can dig down further and see what's cooking.
If your chosen language is difficult/slow to read, then it's on you.
And stuff should have unit tests with decent coverage anyway, those should be even easier for a human to check, even if the LLM wrote them too.
[0] https://en.wikipedia.org/wiki/Code_smell
My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.
Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.
Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.
Do you not review code from your peers? Do you not search online and try to grok code from StackOverflow or documentation examples?
All of these can vary wildly in quality. Maybe its because I mostly use coding LLMs as either a research tool, or to write reasonably small and easy to follow chunks of code, but I find it no different than all of the other types of reading and understanding other people's code I already have to do.
I do these things for this:
- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.
- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.
- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.
In my experience you just don't keep as good a map of the codebase in your head when you have LLMs write a large part of your codebase as when you write everything yourself. Having a really good map of the codebase in your head is what brings you large productivity boosts when maintaining the code. So while LLMs do give me a 20-30% productivity boost for the initial implementation, they bring huge disadvantages after that, and that's why I still mostly write code myself and use LLMs only as a stackoverflow alternative.
> This is a great shortcut to achieve the previous point.
How does doing the hard part provide a shortcut for reviewing all the LLVM code?
If anything it's a long cut, because now you have to understand the code and write it yourself. This isn't great, it's terrible.
Sure whatever works for you; my approach works for me
But you don't explain how doing the hard part shortcuts needing to understand the LLVM code.
The way you've written this comes across like the AI is influencing your writing style....
thatistrue I us ed to write lik this b4 ai it has change my life
Three bullet points, each with three sentences (ok last one has a semicolon instead) is a dead giveaway
I feel like “looks like it’s written by AI” might become a critique of writing that’s very template-like, neutral, corporate. I don’t usually dislike it though, as long as the information is there.
Lots of people wrote like that before AI, AI writes like people its made to copy how people write. It wouldn't write like that if people didn't.
Yes, I prefer using lists myself, too, does not mean my writing is being influenced by AI. I have always liked bullet points long before AI was even a thing, it is for better organization and visual clarity.
Three bullet points AND three sentences?!! Get outta here...
I think this is a great line: > My fear is that LLM generated code will look great to me, I won't understand it fully but it will work
This is a degree of humility that made the scenario we are in much clearer.
Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.
Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.
It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.
This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.
Worst part is that the patterns of implementation won't be consistent across the pieces. So debug a whole codebase that was authored with LLM generated code is like having to debug a codebase where ever function was written by a different developer and no one followed any standards. I guess you can specify the coding standards in the prompt and ask it to use FP-style programming only, but I'm not sure how well it can follow.
Not well, at least for ChatGPT. It can't follow my custom instructions which can be summed up as "follow PEP-8 and don't leave trailing whitespace".
In don't think they meant formatting details.
It is supposed to follow that instruction though. When it generates code, I can tell is to use tabs, 2 spaces, etc. and the generated code will use that. It works well with Claude, at least.
> But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws.
Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.
Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.
When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?
Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.
For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.
Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.
what if you
- generate - lint - format - fuzz - test - update
infintely?
Then you'll get code that passes the tests you generate, where "tests" includes whatever you feed the fuzzer to detect problems. (Just crashes? Timeouts? Comparison with a gold standard?)
Sorry, I'm failing to see your point.
Are you implying that the above is good enough, for a useful definition of good enough? I'm not disagreeing, and in fact that was my starting assumption in the message you're replying to.
Crap code can pass tests. Slow code can pass tests. Weird code can pass tests. Sometimes it's fine for code to be crap, slow, and/or weird. If that's your situation, then go ahead and use the code.
To expand on why someone might not want such code, think of your overall codebase as having a time budget, a complexity budget, a debuggability budget, an incoherence budget, and a maintenance budget. Yes, those overlap a bunch. A pile of AI-written code has a higher chance of exceeding some of those budgets than a human-written codebase would. Yes, there will be counterexamples. But humans will at least attempt to optimize for such things. AIs mostly won't. The AI-and-AI-using-human system will optimize for making it through your lint-fuzz-test cycle successfully and little else.
Different constraints, different outputs. Only you can decide whether the difference matters to you.
> Then you'll get code that passes the tests you generate
Just recently I think here on HN there was a discussion about how neural networks optimize towards the goal they are given, which in this case means exactly what you wrote, including that the code will do stuff in wrong ways just to pass the given tests.
Where do the tests come from? Initially from a specification of what "that thing" is supposed to do and also not supposed to do. Everyone who had to deal with specifications in a serious way knows how insanely difficult it is to get these right, because there are often things unsaid, there are corner cases not covered and so on. So the problem of correctness is just shifted, and the assumption that this may require less time than actually coding ... I wouldn't bet on it.
Conceptually the idea should work, though.
Who has that much time and money when your boss is breathing down your neck?
Publicly available code with lots of prior usage seems less likely to be buggy than LLM-generated code produced on-demand and for use only by me.
To fight this I mostly do ping-pong pairing with llms. After e discuss the general goal and approach I usually write the first test. The llm the makes it pass and writes the next test which I'll make pass and so on. It forces me to stay 100% in the loop and understand everything. Maybe it's not as fast as having the llm write as much as possible but I think it's a worthwhile tradeoff.
The big argument against it is, at some point, there’s a chance, that you won’t really need to understand what the code does. LLMs writes code, LLMs write tests, you find bugs, LLM fixes code, LLM adds test cases for the found bug. Rinse and repeat.
For fairly simple projects built from scratch, we're already there.
Claude Code has been doing all of this for me on my latest project. It's remarkable.
It seems inevitable it'll get there for larger and more complex code bases, but who knows how far away that is.
What do you do when the LLM doesn't fix the code?
You tell it there's an error, and to fix the code (/s)
> My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.
If you don’t understand it, ask the LLM to explain it. If you fail to get an explanation that clarifies things, write the code yourself. Don’t blindly accept code you don’t understand.
This is part of what the author was getting at when they said that it’s surfacing existing problems not introducing new ones. Have you been approving PRs from human developers without understanding them? You shouldn’t be doing that. If an LLM subsequently comes along and you accept its code without understanding it too, that’s not a new problem the LLM introduced.
Code reviews with a human are a two way street. When I find code that is ambiguous I can ask the developer to clarify and either explain their justification or ask them to fix it before the code is approved. I don’t have to write it myself, and if the developer is simply talking in circles then I’d be able to escalate or reject—and this is a far less likely failure case to happen with a real trusted human than an LLM. “Write the code yourself” at that point is not viable for any non-trivial team project, as people have their own contexts to maintain and commitments/projects to deliver. It’s not the typing of the code that is the hard part which is the only real benefit of LLMs that they can type super fast, it’s fully understanding the problem space. Working with another trusted human is far far different from working with an LLM.
No one takes the time to fully understand all the PRs they approve. And even when you do take the time to “fully understand” the code, it’s very easy for your brain to trick you into believing you understand it.
At least when a human wrote it, someone understood the reasoning.
>My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.
puzzled. if you don't understand it fully, how can you say that it will look great to you, and that it will work?
It happens all the time. Way before LLM. There were countless times I implemented an algorithm from a paper or a book while not fully understanding it (in other words, I can't prove the correctness or time complexity without referencing the original paper).
> if you don't understand it fully, how can you say that it will look great to you, and that it will work?
Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.
A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.
Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.
All of this. Could have saved me a comment [0] if I'd seen this earlier.
When people talk about 30% or 50% coding productivity gains with LLMs, I really want to know exactly what they're measuring.
[0] https://news.ycombinator.com/item?id=43236792
> ...but it will work
You don't know that though. There's no "it must work" criteria in the LLM training.
I wouldn't be great at finding bugs in it or logical flaws
This is what tests are for.
You can't test quality into a product.
The tests are probably LLM generated as well lol
> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!
I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.
I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.
Spoken by someone who hasn't had to maintain Somene Else's Code on a budget.
You can't just rewrite everything to match your style. You take what's in there and adapt to the style, your personal preference doesn't matter.
They said “mentally rewrite”, not actually rewrite.
I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.
Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.
Agree - case in point - dealing with race conditions. You have to reason thru the code.
> case in point - dealing with race conditions.
100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D
I think it just likes compiler builtins because I mentioned GCC at some point...not sure that human reasoning actually beats testing when checking for correctness
"Beware of bugs in the above code; I have only proved it correct, not tried it."
Donald E. Knuth
The production of such tests presumably requires an element of human reasoning.
The requirements have to come from somewhere, after all.
Both are necessary, they complement each other.
Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.
With any non trivial system you can’t actually test every corner case. You depend on human reason to identify the ones most likely to cause problems.
Well, what if you run a complete test suite?
There is no complete test suite, unless your code is purely functional and has a small-ish finite input domain.
And even then, your code could pass all tests but be a spaghetti mess that will be impossible to maintain and add features to.
Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.
If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.
>I think that's a net-benefit compared to human-only generated code.
Net-benefit in what terms though? More productive WRT raw code output? Lower error rate?
Because, something about the idea of generating tons of code via LLMs, which humans have to then verify, seems less productive to me and more error-prone.
I mean, when verifying code that you didn't write, you generally have to fully reason through it, just as you would to write it (if you really want to verify it). But, reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
OTOH, if you just breeze through it because it looks correct, you're likely to miss errors.
The latter reminds me of the whole "Full self-driving, but keep your hands on the steering wheel, just in case" setup. It's going to lull you into overconfidence and passivity.
> "Full self-driving, but keep your hands on the steering wheel, just in case" setup
This is actually a trick though. No one working on self driving actually expects people to actually babysit it for long at all. Babysitting actually feels worse than driving. I just saw a video on self-driving trucks and how the human driver had his hands hovering on the wheel. The goal of the video is to make you think about how amazing self-driving rigs will be, but all I could think about was what an absolutely horrible job it will be to babysit these things.
Working full-time on AI code reviews sounds even worse. Maybe if it's more of a conversation and you're collaboratively iterating on small chunks of code then it wouldn't be so bad. In reality though, we'll just end up trusting the AI because it'll save us a ton of money and we'll find a way to externalize the screw ups.
> reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
And, in my experience, it’s a lot easier to latch on to a real person’s real line of reasoning rather than a chatbot’s “line of reasoning”
Also after reasonable period if you are stuck you can actually ask them what were they thinking and why was it written that way and what are the constrains they thought of.
And you can discuss these, with both of you hopefully having experience in the domain.
Exactly. And, if correction is required, then you either re-write it or you're stuck maintaining whatever odd way the LLM approached the problem, whether it's as optimal (or readable) as a human's or not.
If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.
If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.
You mean, for example test that your sieve finds all primes, and only primes that fit in 4096 bits?
Paging Dr. Turing. Dr. Turing, please report to the HN comment section.
Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.
They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:
https://www.youtube.com/watch?v=NW6PhVdq9R8
This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.
In my own experience, I have all sorts of ways that I try to 'drag' the llm out of some line of 'thinking' by editing the conversation as a whole, or just restarting the whole prompt, and I've been kind of just doing this over time since GPT3.
While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.
I don't agree. What if the LLM takes a two-step approach, where it first determines a global architecture, and then it fills in the code? (Where it hallucinates in the first step).
Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.
[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...
[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...
I don't know if the model changed in the last six months, or maybe the wow factor has worn off a bit, but it also feels like ChatGPT has become a lot more "people-pleasy" than it was before.
I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.
I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.
I haven't played with it too much, and maybe it's changed recently or the paid version is different, but last week I found it irritatingly obtuse.
> Me: Hi! Could you please help me find the problem with some code?
> ChatGPT: Of course! Show me the code and I'll take a look!
> Me: [bunch o' code]
> ChatGPT: OK, it looks like you're trying to [do thing]. What did you want help with?
> Me: I'm trying to find a problem with this code.
> ChatGPT: Sure, just show me the code and I'll try to help!
> Me: I just pasted it.
> ChatGPT: I can't see it.
Yeah, I think it's coded to be super-conciliatory as some sort of apology for its hallucinations, but I find it annoying as well. Part of it is just like all automated prompts that try to be too human. When you know it's not human, it's almost patronizing and just annoying.
But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.
It's obvious, isn't it? The average Hacker News user, who has converged to the average Internet user, wants exactly that experience. LLMs are pretty good tools but perhaps they shouldn't be made available to others. People like me can use them but others seem to be killed when making contact. I think it's fine to restrict access to the elite. We don't let just anyone fly a fighter jet. Perhaps the average HN user should be protected from LLM interactions.
Is that really what you got from what I wrote? I wasn't suggesting that we restrict access to anyone, and I wasn't trying to imply that I'm somehow immune to the problems that were highlighted.
I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.
I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.
Haha, I'm not claiming you're wanting that. I want that. So I'm saying it. What makes you think I was attempting to restate what you wrote?
It looked like you were being sarcastic, implying I was trying to suggest that I thought I was better than the average person in regards to handling AI. Particularly this line:
> People like me can use them but others seem to be killed when making contact.
If I misread that, fair enough.
Yeah, no, 100% sincere personal view. That guy who killed himself after using it is obviously not ready for this. Imagine killing yourself after typing in `print("Kill yourself")` at the Python REPL. The guy's an imbecile. We don't let just anyone drive a truck. I'm fine with nearly everyone being on the outside and unable to use these tools so long as I'm allowed to with as little trouble as possible.
I recognize that the view that others should not be permitted things that I should be allowed to use is generally a sarcastically expressed view, but I genuinely think it has merit. Everyone who believes these things are dangerous and everyone to whom this is obviously dangerous, like the aforementioned mentally deficient individual, shouldn't be permitted use.
More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.
I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.
Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.
>Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.
Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..
If you believe current models exist at the limit of possible persuasiveness, there obviously isn't any cause for concern.
For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.
The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"
> [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]
Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.
Yeah...this. I'm not so concerned that AI is going to put me out of a job or become Skynet. I'm concerned that people are offloading decision making and critical thinking to the AI, accepting it's response at face value and responding to concerns with "the AI said so...must be right". Companies are already maliciously exploiting this (e.g. the AI has denied your medical claim, and we can't tell you how it decided that because our models are trade secrets), but it will soon become de rigueur and people will think you're weird for questioning the wisdom of the AI.
The combination of blind faith in AI, and good faith highly informed understanding and agreement, achieved with help of AI, covers the full spectrum of the problem.
In both of your linked examples, the people in question very likely had at least some sort of mental instability working in their minds.
I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.
By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.
Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.
The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.
We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.
Is this somehow worse than humans talking each other into it?
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".
You'd be surprised. I know a few people who couldn't really code before LLMs, but now with LLMs they can just brute-force through problems. They seem pretty undetered about 'trusting' the solution, if they ran it and it worked for them, it gets shipped.
Well I hope this isn’t backend code because the amount of vulnerabilities that are going to come from these practices will be staggering
The backlash will be enormous. In the near future, there will be less competent coders and a tsunami of bad code to fix. If 2020 was annoying to hiring managers they have no idea how bad it will become.
Of course this will be the case, but probably not for the reasons you are concerned about. It is because a lot of people have been enabled by these tools to realize they are able to do things they thought were beyond them.
The opaque wall that separates the solution from the problem in technology often comes from the very steep initial learning curve. The reason most people who are developers now learned to code is because they had free time when they were young, had access to the technology, and were motivated to do it.
But as an adult, very few people are able to get past the first obstacles which keep them from eventually becoming proficient, but now they have a cheat code. So you will see a lot more capable programmers in the future who will be able to help you fix this backlog of bad code -- we just have to wait for them to gain the experience and knowledge needed before that happens and deal with the mistakes along the way.
This is no different from any other enabling technology. The people who feel like they had to struggle through it and pay their dues when it 'wasn't easy' are going to be resentful and try and gatekeep; it is only human nature.
> you have to put a lot of work in to learn how to get good results out of these systems
That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?
Many tools that increase your productivity as a developer take a while to master. For example, it takes a while to become proficient with a debugger, but I'd still wager that it's worth it to learn to use a debugger over just relying on print debugging.
LLM generated code is legacy code.
An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.
The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.
I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.
I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.
This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.
To be clear, by "clinical-non-clinical", you mean a boolean flag for whether the appointment is clinical?
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.
IMX, quite a few Python users - including ones who think they know what they're doing - run into that same confusion, because they haven't properly understood fundamentals e.g. about how virtual environments work, or how to read documentation effectively. Or sometimes just because they've been careless and don't have good habits for ensuring (or troubleshooting) the environment.
If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.
I once submitted some code for review, in which the AI had inserted a recursive call to the same function being defined. The recursive call was completely unnecessary and completely nonsensical, but also not wrong per se - it just caused the function to repeat what it was doing. The code typechecked, the tests passed, and the line of code was easy to miss while doing a cursory read through the logic. I missed it, the code reviewer missed it, and eventually it merged to production.
Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.
fwiw this problem already exists with my more junior co-workers. and also my own code that I write when exhausted!
if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.
I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.
For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.
Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
>though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.
And subtle bugs existed pre-2022, how often my apps are updated for "minor bug fixes" would mean this is par for the course.
Sure, it's possible that the code it gave me was based on some incorrectly written code it scraped from Gitlab or something.
I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.
Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.
Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.
I am not a programmer and i don't use Linux. I've been working on a python script for a raspberry pi for a few months. Chatgpt has been really helpful in showing me how to do things or debug errors.
Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.
Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.
Chatgpt can output a straight diff, too, that you can use with patch.
That's how aider commands the models to reply, for example.
That's not quite right. The models are pretty bad at generating a proper diff, so there are two common formats used. The main one is a search and replace, and the search is then done in quite a fuzzy manner.
To be clear the diff they generate is something you or I could apply manually and wouldn't notice an issue. It's things like very minor whitespace issues, or more commonly the count saying how large the sections are - nothing that affects the meat of the diff, they're fine with the hard part but then there's small counting errors.
Version control inside an IDE helps with noticing these types of changes, even if you aren't a programmer
You can try asking ChatGPT to rewrite the original script to include the improvements.
yea its great at toy projects
In my experience, a tool like Windsurf or Cursor (w/ Sonnet) is great at building a real project, as long the guardrails are clearly defined.
For example, starting a SaaS project from something like Refine.dev + Ant Design, instead of just a blank slate.
Of course, none of what I build is even close to novel code, which helps.
> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.
Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.
So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.
And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.
Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.
> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.
But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).
And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).
Prompt injection means that unless people using LLMs to generate code are willing to hunt down and inspect all dependencies, it will become extremely easy to spread malware.
> Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.
This is an appeal against innovation.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.
You get none of that from reviewing code generated by an LLM.
> This is an appeal against innovation.
No it is not. It is arguing for using more stable and better documented tooling.
so it's an appeal to not innovate on tooling and languages?
It's not appealing to anything.
Even with boring tech that's been in the training set for ages (rails), you can get some pretty funny hallucinations: https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinat... (fortunately this one was the very non-dangerous kind, making it very obvious; though I wonder how many non-obvious hallucinations entered the training set by the same process)
One thing I've found is that while I work with a LLM and it can do things way faster than me, the other side of it is I'm quickly loosing understand of the deeper code.
If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.
At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.
For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
Um. No.
This is oversimplification that falls apart in any at minimum level system.
Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.
And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.
https://xlii.space/blog/network-scenarios/
This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.
As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.
It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.
Such "hallucinations" can also be plausible & useful APIs that oughtta exist – de facto feature requests.
That's right, sometimes it's the children who are wrong.
[dead]
The idea is correct, a lot of people (including myself sometimes) just let an "agent" run and do some stuff and then check later if it finished. This is obviously more dangerous than just the LLM hallucinating functions, since at least you can catch the latter, but the first one depends on the tests of the project or your reviewer skills.
The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
>The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
That still seems useful when you don't already know enough to come up with good search terms.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)
And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.
Timely article. I really, really want AI to be better at writing code, and hundreds of reports suggest it works great if you're a web dev or a python dev. Great! But I'm a C/C++ systems guy(working at a company making money off AI!) and the times I've tried to get AI to write the simplest of test applications against a popular API it mostly failed. The code was incorrect, both using the API incorrectly and writing invalid C++. Attempts to reason with the LLMs(grokv3, deepseek-r1) led further and further away from valid code. Eventually both systems stopped responding.
I've also tried Cursor with similar mixed results.
But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.
It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.
They have also found a way to force every developer and company to get a $20/month subscription forever.
If you want to use LLMs for code, use them.
If you don't, don't.
However, this 'lets move past hallucinations' discourse is just disingenuous.
The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for.
...and people not spending enough time and effort learning to use the tools.
I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
People aren't stupid.
If they use a tool and it sucks, they'll stop using it and say "this sucks".
If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.
People are lazy; if the tool is good (eg. cursor), people will use it.
If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".
Which is literally what is happening. Right. Now.
To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out.
...but that's because it is a real problem
Pretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.
All the criticism of LLM’s code writing ability ignore the fact that its still true that the majority of programmers can’t write FizzBuzz http://imranontech.com/2007/01/24/using-fizzbuzz-to-find-dev... This is the reason why LLM’s will have a large impact on software jobs. Those who can write FizzBuzz or more need not be worried. But they are a small minority.
I don't care about the programmers who can't write FizzBuzz. Why should I? If I employed them, they were costing me money. If I worked with them, they were costing me time and hair follicles. I need them about as much as I need a buggy whip.
The linked article makes the claim that the majority of comp sci majors cannot write FizzBuzz. That's a bold assertion; how did the author sample such people? I suspect the sample pool was people applying for a position. There is a major selection bias there. First, people who fail many interviews will do more interviews than those who do not fail, so you'll start with a built-in bias towards the less competent (or more nervous).
Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them. If getting a chance at jumping over those bars requires me to get a particular piece of paper with a particular title printed at the top of it, I'll be motivated to get that piece of paper too.
> Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them.
Why don't we see job positions for doctors and lawyers similarly flooded, then?
Because there is a high barrier to entry. In the US, at least, there was also an explicit policy and set of mechanisms to limit supply of doctors: https://www.advisory.com/daily-briefing/2022/02/16/physician...
For lawyers, there is an oversupply of the most lucrative segments, and an undersupply everywhere else: https://www.ajs.org/is-there-a-shortage-of-lawyers/
But in both cases, there just isn't some low bar that you can finagle your way over and get to the promised riches. Lawyers have a literal Bar, and it isn't low. Doctors have a ton of required training. Both have serious certification requirements that computer science professionals do not. Both professions support my point.
Furthermore, incompetent lawyers face real-world tests. If they lose their cases or otherwise screw things up, they are not going to be raking in the money. And people are trying their best to flood the doctor market, by inventing certifications that avoid the requirements to be a physician and setting themselves up as alternative medicine specialists or naturalists or generic "healers" or whatever. (I'm not saying they're all crap, but I am saying that unqualified people are flooding those positions.)
>the majority of programmers can’t write FizzBuzz
How did they get through the Leetcode-style interviews before LLMs and remote interviewing?
I agree with the author. But can't the risk be minimized somehow by asking LLM A to generate code and LLM B to write integration tests?
As a non programmer I only get little programs or scripts that do something from the LLM. If they do the thing it means the code is tested, flawless and done. I would never let them have to deal with other humans Input of course.
I am not so sure. Code by one LLM can be reviewed by another. Puppeteer like solutions will exist pretty soon. "Given this change, can you confirm this spec".
Even better, this can carry on for a few iterations. And both LLMs can be:
1. Budgeted ("don't exceed X amount")
2. Improved (another LLM can improve their prompts)
and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.
That's ok. Writing such a spec is writing the code, declaratively.
The only difference between that and writing SQL (as opposed to writing imperative code to query the database) is that the translation mechanism is much more sophisticated, much less energy efficient, much slower, and most significantly much more error-prone than a SQL interpreter.
But declarative coding is good! It has its issues, and LLMs in particular compound the problems, but it's a powerful technique when it works.
> Code by one LLM can be reviewed by another
Reviewed against what? Who is writing the specs?
the user who wants it? and a premature retort: if the feedback is "the user / PM / stakeholder could be wrong", then... that's where we are. A "refiner" LLM can be fronted (Replit is playing with this for instance).
To be clear: this is not something I do currently, but my point is that one needs to detach from how _we_ engineers do this for a more accurate evaluation of whether these things truly do not work.
I'm excited to see LLMs get much better at testing. They are already good at writing unit tests (as always, you have to review them carefully). But imagine an LLM that can see your code changes _and_ can generate and execute automated and manual tests based on the change.
Great article, but doesn't talk about the potentially _most_ dangerous form of mistakes: an adversarial LLM trying to inject vulnerabilities. I expect this to become a vector soon as people figure out ways to accomplish this
Software is the manifestation of a solution to a problem.
Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.
When you go from the adze to the chainsaw, be mindful that you still need to sharpen the chainsaw, top up the chain bar oil, and wear chaps.
Edit: oh and steel capped boots.
Edit 2: and a face shield and ear defenders. I'm all tuckered out like Grover in his own alphabet.
I'm not remotely convinced that LLMs are a chainsaw, unless they've been very thoroughly trained on the problem domain. LLMs are good for vibe coding, and some of them (Grok 3 is actually good at this) can speak passable Latin, but try getting them to compose Sotadean verse in Latin or put a penthemimeral caesura in an iambic trimeter in ancient Greek. They can define a penthemimeral caesura and an iambic trimeter, but they don't understand the concepts and can't apply one to the other. All they can do is spit out the next probable token. Worse, LLMs have lied to me on the definition of Sotadean verse, not even regurgitating what Wikipedia should have taught them.
Image-generating AIs are really good at producing passable human forms, but they'll fail at generating anything realistic for dice, even though dice are just cubes with marks on them. Ask them to illustrate the Platonic solids, which you can find well-illustrated with a Google image search, and you'll get a bunch of lumps, some of which might resemble shapes. They don't understand the concepts: they just work off probability. But, they look fairly good at those probabilities in domains like human forms, because they've been specially trained on them.
LLMs seem amazing in a relatively small number of problem domains over which they've been extensively trained, and they seem amazing because they have been well trained in them. When you ask for something outside those domains, their failure to work from inductions about reality (like "dice are a species of cubes, but differentiated from other cubes by having dots on them") or to be able to apply concepts become patent, and the chainsaw looks a lot like an adze that you spend more time correcting than getting correct results from.
Chainsaws are deterministic. Using LLMs is more akin to trying to do topiary by juggling axes.
If X, AWS, Meta, and Google would just dump their code into a ML training set we could really get on with disrupting things.
I asked o3-mini-high (investor paying for Pro, I personally would not) to critique the Developer UX of D3's "join" concept (how when you select an empty set then when you update you enter/exit lol) and it literally said "I'm sorry. I can't help you with that." The only thing missing was calling me Dave.
> I asked Claude 3.7 Sonnet "extended thinking mode" to review an earlier draft of this post [snip] It was quite helpful, especially in providing tips to make that first draft a little less confrontational!
So he's also using LLMs to steer his writing style towards the lowest common denominator :)
Yep. LLMs can get all the unit tests to pass. But not the acceptance tests. The discouraging thing is you might have all green checks on the unit tests, but you can’t get the acceptance tests to pass without starting over.
> Compare this to hallucinations in regular prose, where you need a critical eye, strong intuitions and well developed fact checking skills to avoid sharing information that’s incorrect and directly harmful to your reputation
Ah so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
Not just that, “it works” is a very, very low bar to have for your code. To illustrate, the other day I tested an LLM by having it create a REST API. I asked for an end point where I could update a particular field of the record (think liking a post).
Then I decided to add on more functionality and asked for the ability to update all the other fields…
As you can guess, it gave me one endpoint per field for that entity. Sure, “it works”…
There are human developers who do the same thing…
There are humans that do extreme sports just for the thrill. I still don't want my car to have a feature that can got it to throw itself off a cliff.
> Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
I actually do this (and I'm not proud of it)
I thought he was going to say the really danger is hallucination of facts, but no.
I'm just here to whine, almost endlessly, that the word "hallucination" is a term of art chosen deliberately because it helps promote a sense AGI exists, by using language which implies reasoning and consciousness. I personally dislike this. I think we were mistaken allowing AI proponents to repurpose language in that way.
It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.
“Errors”?
Errors are a category of well understood and explicit failures.
Slop is the best description. LLMs are sloppy tools and some people are not discerning enough to know that blindly running this slop is endangering themselves and others.
> My less cynical side assumes that nobody ever warned them that you have to put a lot of work in to learn how to get good results out of these systems
Why am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.
The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.
It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.
This is not nearly as strong an argument as the author thinks it is.
> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).
This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.
The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.
Also the thing that people miss is compounded experience. Just starting with any language, you have to read a lot of documentation, books, and articles. After a year or so, you have enough skeleton projects, code samples, and knowledge, that you could build a mini framework if the projects were repetitive. Even then, you could just copy paste features that you've already implemented, like that test harness or the Rabbitmq integration an be very productive that way.
Code testing is “human in the loop” for LLM generated code.
The worst for me so far has been the following:
1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.
2. So why not ask the LLM, right?
3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.
4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.
5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.
Personally I believe the worst with llm is it's abysmal ability to architect code, it's why I use llms more like a Google than a so called coding buddy, because there was so many times I had to rewrite the entire file because the llm had added in so much extra unmanageable functions,even deciding to solve problems I hadn't asked it to do.
Increasingly I see apologists for LLMs sounding like people justifying fortune tellers and astrologists. The confidence games are in force, where the trick involves surreptitiously eliciting all the information the con artist needs from the mark, then playing it back to them as if it involves some deep and subtle insights.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.
I don’t really understand what the point or tone of this article is.
It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.
It sure gives the impression that the article itself was written by an LLM and barely edited by a human.
Just ask another LLM to proof read?
Do you realize that giving LLMs 'instructions' is merely trying to blindly twist knobs by random amounts?
> The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!
Are these not considered hallucinations still?
Humans can hallucinate up some API they want to call in the same way that LLMs can, but you don't call all human mistakes hallucinations; classifying everything LLMs do wrong as hallucinations would seem rather pointless to me.
Analogizing this to human hallucination is silly. In the instance you're talking about, the human isn't hallucinating, they're lying.
I definitely wouldn't say I'm lying (...to.. myself? what? or perhaps others for a quick untested response in a chatroom or something) whenever I write some code and it turns out that I misremembered the name of an API. "Hallucination" for that might be over-dramatic but at least it it's a somewhat sensible description.
Maybe we should stop referring to undesired output (confabulation? Bullshit? Making stuff up? Creativity?) as some kind of input delusion. Hallucination is already a meaningful word and this is just gibberish in that context.
As best I can tell, the only reason this term stuck is because early image generation looked super trippy.
I think of hallucinations as instance where an LLM invents something that is entirely untrue - like a class or method that doesn't exist, or a fact about the world that's unnoticed true.
I guess you could call bugs in LLM code "hallucinations", but they feel like a slightly different thing to me.
That's a great distinction actually. Thanks
I don't think it's necessarily a hallucination if models accurately reproduce the code quality of their training data.
I find it a bit surprising that I'm being called an "LLM fanboy" for writing an article with the title "Hallucinations in code are the least dangerous form of LLM mistakes" where the bulk of the article is about how you can't trust LLMs not to make far more serious and hard-to-spot logic errors.
What do you mean by "harder stuff"? What about an experimental DSL written in C with a recursive descent parser and a web server runtime that includes Lua, jq, a Postgres connection pool, mustache templates, request-based memory arena, database migrations and much more? 11,000+ lines of code with ~90% written by Claude in Cursor Composer.
https://github.com/williamcotton/webdsl
Frankly us "fanbois" are just a little sick and tired of being told that we must be terrible developers working on simple toys if we find any value from these tools!
I'm a strong believer that LLMs are tools and when wielded by talented and experienced developers they are somewhere in the danger category of Stack Overflow and transitive dependencies. This is not a critique of your project, or really the quality of LLMs, but when I see 90% of a 11,000+ loc project written in Claude, it just feels sort of depressing in a way I haven't processed yet.
I love foss, I love browsing projects of all quality levels and vintages and seeing how things were built. I love learning new patterns and sometimes even bickering over their strengths and weaknesses. An LLM generated code base hardly makes me even want to engage with it...
Perhaps these feelings are somewhat analogous to hardcopies vs ebooks? My opinions have changed over time and I read and collect both. Have you had similar thoughts and gotten over them? Do you see tools like Claude in a way where this isn't an issue?
You're romanticizing software. To place more value in the code than the outcome. There's nothing wrong with that, but most people that use software don't think about it that way.
I mean, when I'm working on something that I don't expect to be more than a throw-away experiment I'm not too worried about the code itself.
The grammar itself still seems a bit clunky and the next time I head down this path I imagine I'll go with a more hand-crafted approach.
I learned a lot about integrating Lua and jq into a project along the way (and how to make it performant), something I had no prior experience with.
Some free code review of the first file I clicked into - https://github.com/williamcotton/webdsl/blob/92762fb724a9035... among other places should probably be doing the conditional "lexer->line++"; thing. Quite a weird decision to force all code paths to manually do that whenever a newline char is encountered. Could've at least made a "advance_maybe_newline(lexer);" or so. But I guess LLMs give you copy-paste garbage.
Even the article of this thread says:
> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing.
Thanks for taking a look! The lexer and parser is probably close to 100% Claude and I definitely didn't review it completely. I spent most of the time trying out different grammars (normally something you want to do before you start writing code) and runtime features! "Build the web server runtime and framework into the language" was an idea kicking around in my head for a few years but until Cursor I didn't have the energy to play around with the idea.
Okay so this is a personal opinion right? Like where is the objectivity in your review?
What are the hardline performance characteristics being violated? Or functional incorrectness. Is this just "it's against my sensibilities" because at the end of the day frankly no one agrees on how to develop anything.
The thing I see a lot of developers struggle with is just because it doesn't fit your mental model doesn't make it objectively bad.
So unless it's objectively wrong or worse in a measurable characteristic I don't know that it matters.
For the record I'm not asserting it is right, I'm just saying I've seen a lot of critiques of LLM code boil down to "it's not how I'd write it" and I wager that holds for every developer you'll ever interact with.
OP didn't put much effort into writing the code so I'm certainly not putting in much effort into a proper review of it, for no benefit to me no less. I just wanted to see what quality AI gets you, and made a comment about it.
I'm pretty sure the code not having the "if (…) lexer->line++" in places is just a plain simple repeated bug that'll result in wrong line numbers for certain inputs.
And human-wise I'd say the simple way to not have made that bug would've been to make/change abstractions upon the second or so time writing "if (…) lexer->line++" such that it takes effort to do it incorrectly, whereas the linked code allows getting it wrong by default with no indication that there's a thing to be gotten wrong. Point being that bad abstractions are not just a maintenance nightmare, but also makes doing code review (which is extra important with LLM code) harder.
I agree, it seems a lot of the complaints boil down to academic reasons.
Fine it's not the best and perhaps may run into some longer term issues but most importantly it works at this point in time.
A snobby/academic equivalent would be someone using an obscure language such as COBOL.
The world continues to turn.
I’m always really sceptical of any “proof by example” that is essentially anecdotal.
If this is going to be your argument, you need a solid scientific approach. A study where N developers are given access to a tool vs N that are not, controls are in place etc.
Because the overwhelming majority of coders I speak to are saying exactly the same thing, which is LLMs are a small productivity boost. And the majority of cursor users, which is admittedly a much smaller number, are saying it just gets stuck playing whack a mole. And common sense says these are the expected outcomes, so we are going to need really rigorous work to convince people that LLMs can build 90% of most deeply technical projects. Exceptional results require exceptional evidence.
And when we do see anecdotal incidents that seem so divergent from the norm, well that then makes you wonder how that can be, is this really objective or are we in some kind of ideological debate?
Protip: when you block a user in github it let's you add a note as to why that will show in their profile. It will also alert you when you see a repository to which that user has contributed.
Honest question: this looks like a library others can use to build websites. It contains features related to authentication and security. If it's 90% LLM generated, how do you sleep at night? I'd be dead scared someone would use this, hit a bug that leaks PII (or worse) and then sue me into oblivion.
"WebDSL is an experimental domain-specific language and server implementation for building web applications."
And it's MIT:
..."request-based memory arena"...
there are some very questionable things going on with the memory handing in this code. just saying.
Request-based memory arenas are pretty standard for web servers!
Maybe be, after all - I dont write web servers (btw, the PQ and JQ libraries doesnt seem to use the arena allocator, which makes the whole proposition a bit dubious, but lets say that its me being picky).
What I meant was, that IMO the code is not very robust when dealing with memory allocations:
1. The "string builder" for example silently ignores allocation failures and just happily returns - https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
2. In what seems most of the places, the code simply doesnt check for allocation failures, which leads to overruns (just couple of examples):
https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
Thanks for digging in. Yup, those two libs don’t support custom allocators. I raised an issue in the jq repo to ask if they thought about adding it.
Great points about happy path allocations. If I ever touch the project again I’ll check each location.
Note to self: free code reviews of projects if you mention LLMs!
"People took a cursory look at a codebase I published and found glaring mistakes they discussed publicly as examples of how bad it is" is not the flex you think it is.
[dead]