I feel like we get one of these articles that addresses valid AI criticisms with...

mountainriver · 2025-06-02T21:33:43 1748900023

I feel the opposite, and pretty much every metric we have shows basically linear improvement of these models over time.

The criticisms I hear are almost always gotchas, and when confronted with the benchmarks they either don’t actually know how they are built or don’t want to contribute to them. They just want to complain or seem like a contrarian from what I can tell.

Are LLMs perfect? Absolutely not. Do we have metrics to tell us how good they are? Yes

I’ve found very few critics that actually understand ML on a deep level. For instance Gary Marcus didn’t know what a test train split was. Unfortunately, rage bait like this makes money

Night_Thastus · 2025-06-02T22:05:31 1748901931

Models are absolutely not improving linearly. They improve logarithmically with size, and we've already just about hit the limits of compute without becoming totally unreasonable from a space/money/power/etc standpoint.

We can use little tricks here and there to try to make them better, but fundamentally they're about as good as they're ever going to get. And none of their shortcomings are growing pains - they're fundamental to the way an LLM operates.

mountainriver · 2025-06-02T23:17:58 1748906278

Most of the benchmarks are in fact improving linearly, we often don't even know the size. You can find this out but just looking at the scores over time.

And yes, it often is small things that make models better. It always has been, bit by slow they get more powerful, this has been happening since the dawn of machine learning

_dain_ · 2025-06-02T23:04:28 1748905468

remember in 2022 when we "hit a wall"? everyone said that back then. turned out we didn't.

and in 2023 and 2024 and january 2025 and ...

all those "walls" collapsed like paper. they were phantoms; ppl literally thinking the gaps between releases were permanent flatlines.

money obviously isn't an issue here, VCs are pouring in billions upon billions. they're building whole new data centres and whole fucking power plants for these things; electricity and compute aren't limits. neither is data, since increasingly the models get better through self-play.

>fundamentally they're about as good as they're ever going to get

one trillion percent cope and denial

jhonof · 2025-06-03T02:13:32 1748916812

The difference in quality between model versions has slowed down imo, I know the benchmarks don't say that but as a person who uses LLMs everyday, the difference between Claude 3.5 and the cutting edge today is not very large at all, and that model came out a year ago. The jumps are getting smaller I think, unless the stuff in house is just way ahead of what is public at the moment.

yahoozoo · 2025-06-03T11:10:23 1748949023

Yet we are still at the “treat it like a junior” level

nickpsecurity · 2025-06-03T03:06:48 1748920008

"pretty much every metric we have shows basically linear improvement of these models over time."

They're also trained on random data scraped off the Internet which might include benchmarks, code that looks like them, and AI articles with things like chain of thought. There's been some effort to filter obvious benchmarks but is that enough? I cant know if the AI's are getting smarter on their own or more cheat sheets are in the training data.

Just brainstorming, one thing I came up with is training them on datasets from before the benchmarks or much AI-generated material existed. Keep testing algorithmic improvements on that in addition to models trained on up to date data. That might be a more accurate assessment.

mountainriver · 2025-06-03T17:06:30 1748970390

thats not a bad idea, very expensive though, and you end up with a pretty useless model in most regards.

A lot of the trusted benchmarks today are somewhat dynamic or have a hidden set.

nickpsecurity · 2025-06-03T23:59:14 1748995154

That could happen. One would need to risk it to take the approach. However, if it was trained on legal data, then there might be a market for it among those not risking copyright infringement. Think FairlyTrained.org.

"somewhat dynamic or have a hidden set"

Are there example inputs and outputs for the dynamic ones online? And are the hidden sets online? (I haven't looked at benchmark internals in a while.)

attemptone · 2025-06-02T22:05:11 1748901911

>I feel the opposite, and pretty much every metric we have shows basically linear improvement of these models over time.

Wait, what kind of metric are you talking about? When I did my masters in 2023 SOTA models where trying to push the boundaries by minuscule amounts. And sometimes blatantly changing the way they measure "success" to beat the previous SOTA

mountainriver · 2025-06-02T23:18:51 1748906331

Almost every single major benchmark, and yes progress is incremental but it adds up, this has always been the case

attemptone · 2025-06-03T06:35:10 1748932510

We were talking about linear improvements and I have yet to see it

mountainriver · 2025-06-03T17:04:48 1748970288

check the benchmarks or make one of your own

attemptone · 2025-06-03T20:20:04 1748982004

I checked the BlEU-Score and Perplexity of popular models and both have stagnated around 2021. As a disclaimer this was a cursory check and I didn't dive into the details of how individuals scores were evaluated.

mountainriver · 2025-06-04T16:37:03 1749055023

on what benchmarks? pretty much every major one is linear improvement

mrkurt · 2025-06-02T21:19:39 1748899179

PLEASE write your response. We'll publish it on the Fly.io blog. Unedited. If you want.

ofjcihen · 2025-06-02T21:28:28 1748899708

I’m uninterested in giving you content. In particular because of your past behavior.

Thanks for the offer though.

throwaway314155 · 2025-06-02T21:57:08 1748901428

> past behavior

Do go on.

tptacek · 2025-06-02T21:33:38 1748900018

Kurt, how dare you.

ofjcihen · 2025-06-02T22:25:32 1748903132

You wouldn’t happen to work for fly.io as well, would you?

Edit: Nm, thought I remembered your UN and see on your profile that you do.

grzm · 2025-06-02T22:25:54 1748903154

Yes. And the author of the submission.

Mofpofjis · 2025-06-02T23:50:31 1748908231

OMG I didn't notice. Way to burn down a huge amount of respect stemming from past cryptography work in just one blog post.

dogecoinbase · 2025-06-03T03:17:46 1748920666

it's irksome to see tptacek having his patio11 moment for sure.

tptacek · 2025-06-03T04:03:46 1748923426

What does this even mean? (I should be so lucky.)

kubb · 2025-06-02T21:26:59 1748899619

Maybe make a video of how you're vibecoding a valuable project in an existing codebase, and how agents are saving you time by running your tools in a loop.

metaltyphoon · 2025-06-02T21:52:19 1748901139

Seriously… thats the one thing I never see being posted? Is it because Agent mode will take 30-40 minutes to just bookstrap a project and create some file?

sensanaty · 2025-06-03T00:53:49 1748912029

Well no, the reality of this workflow is the farcica, abject failures unleashed on the Dotnet codebase a week ago.

https://news.ycombinator.com/item?id=44050152

csallen · 2025-06-02T22:12:29 1748902349

It takes like 2-6 minutes to do that, depending on the scope of the project

andrepd · 2025-06-02T22:14:48 1748902488

So they can cherry pick the 1 out of 10 times that it actually performs in an impressive manner? That's the essence of most AI demos/"benchmarks" I've seen.

Testing for myself has always yielded unimpressive results. Maybe I'm just unlucky?

kubb · 2025-06-03T06:18:01 1748931481

Livestream would be fair.

briandrupieski · 2025-06-02T21:26:08 1748899568

> with poor arguments every week

This roughly matches my experience too, but I don't think it applies to this one. It has a few novel things that were new ideas to me and I'm glad I read it.

> I’m ready to write a boilerplate response because I already know what they’re going to say

If you have one that addresses what this one talks about I'd be interested in reading it.

slg · 2025-06-02T22:01:54 1748901714

>> with poor arguments every week

>This roughly matches my experience too, but I don't think it applies to this one.

I'm not so sure. The argument that any good programming language would inherently eliminate the concern for hallucinations seems like a pretty weak argument to me.

ofjcihen · 2025-06-02T22:08:36 1748902116

It’s a confusing one for sure.

To be honest I’m not sure where the logic for that claim comes from. Maybe an abundance of documentation is the assumption?

Either way, being dismissive of one of LLMs major flaws and blaming it on the language doesn’t seem like the way to make that argument.

simonw · 2025-06-02T22:10:06 1748902206

Why does that seem weak to you?

It seems obviously true to me: code hallucinations are where the LLM outputs code with incorrect details - syntax errors, incorrect class methods, invalid imports etc.

If you have a strong linter in a loop those mistakes can be automatically detected and passed back into the LLM to get fixed.

Surely that's a solution to hallucinations?

It won't catch other types of logic error, but I would classify those as bugs, not hallucinations.

slg · 2025-06-02T22:23:21 1748903001

>It won't catch other types of logic error, but I would classify those as bugs, not hallucinations.

Let's go a step further, the LLM can produce bug free code too if we just call the bugs "glitches".

You are making a purely arbitrary decision on how to classify an LLM's mistakes based on how easy it is to catch them, regardless of their severity or cause. But simply categorizing the mistakes in a different bucket doesn't make them any less of a problem.

layer8 · 2025-06-02T22:52:14 1748904734

I don’t see why an LLM wouldn’t hallucinate project requirements or semantic interface contracts. The only way you could escape that is by full-blown formal verification and specification.

ofjcihen · 2025-06-02T22:16:51 1748902611

A good example of where a linter wouldn’t work is when the LLM has you import a package that doesn’t exist.

kubb · 2025-06-02T21:31:18 1748899878

There's also the reverse genre: valid criticism of absolutely strawman arguments that nobody makes.

tptacek · 2025-06-02T21:34:16 1748900056

Which of the arguments in this post hasn't occurred on HN in the past month or so?

sethev · 2025-06-02T23:56:15 1748908575

Or in this very thread, for that matter.

kubb · 2025-06-03T11:00:18 1748948418

"I tried copilot 2 years ago and I didn’t like it."

Great article BTW, it’s amazing that you’re now blaming developers smarter than you for lack of LLM adoption, as if it weren’t enough for the technology to be useful to become widespread.

Try to deal with „an agent takes 3 minutes to make a small transformation to my codebase and it takes me another 5 to figure out why it changed what it did only to realize that it was the wrong approach and redo it by hand, which took another 7 minutes” in your next one.

csallen · 2025-06-02T21:22:49 1748899369

Can you direct me somewhere with superior counterarguments? I'm quite curious

calf · 2025-06-02T22:09:33 1748902173

What valid AI criticisms? Most criticisms of AI are not very deep nor founded in complexity theoretic arguments, whereas Yann LeCun himself gave an excellent 1 slide explanation of the limits of LLMs. Most AI criticisms are low quality arguments.

therealpygon · 2025-06-02T22:59:19 1748905159

“Valid” criticism rarely come from the people barely capable of understanding the difference between AI and LLMs, and using them interchangeably.