> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!
> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.
While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.
The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?
Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.
every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now
You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.
This assumes that the companies gathering the data don’t have silent ways of detecting bad actors and discarding their responses. If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known? Are you using a tool to generate this bad data, which might have detectable word frequency patterns that can be detected with something cheap like tf-idf?
There’s a lot of incentive to figure this out. And they have so much data coming in that they can likely afford to toss out some good data to ensure that they’re tossing out all of the bad.
Not necessarily, not all tactics can be used symmetrically like that. Many of the sites they scrape feel the need to support search engine crawlers and RSS crawlers, but OpenAI feels no such need to grant automated anonymous access to ChatGPT users.
And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.
If you want to pick apart my hastily concocted examples, well, have fun I guess. My overall point is that ensuring data quality is something OpenAI is probably very good at. They likely have many clever techniques, some of which we could guess at, some of which would surprise us, all of which they’ve validated through extensive testing including with adversarial data.
If people want to keep playing pretend that their data poisoning efforts are causing real pain to OpenAI, they’re free to do so. I suppose it makes people feel good, and no one’s getting hurt here.
I'm interested in why you think OpenAI is probably very good at ensuring data quality. Also interested if you are trying to troll the resistance into revealing their working techniques.
What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?
It is absolutely fascinating to read the fantasy produced by people who (apparently) think they live in a sci-fi movie.
The companies whose datasets you're "poisoning" absolutely know about the attempts to poison data.
All the ideas I've seen linked on this side so far about how they're going to totally defeat the AI companies' models sound like a mixture of wishful thinking and narcissism.
Are you suggesting some kind of invulnerability? People iterate their techniques, if big techs are so capable of avoiding poisoning/gaming attempts there would be no decades long tug-of-war between Google and black hat SEO manipulators.
Also I don't get the narcissism part. Would it be petty to poison a website only when looked by a spider? Yes, but I would also be that petty if some big company doesn't respect the boundaries I'm setting with my robots.txt on my 1-viewer cat photo blog.
Its not complete invulnerability. Instead, it is merely accepting that these methods might increase costs, like a little bit, but they don't cause the whole thing to explode.
The idea that a couple bad faith actions can destroy a 100 billion dollar company, is the extraordinary claim that requires extraordinary evidence.
Sure, bad actors can do a little damage. Just like bad actors can do DDoS attempts against Google. And that will cause a little damage. But mostly Google wins. Same thing applies to these AI companies.
> Also I don't get the narcissism part
The narcissism is the idea that your tiny website is going to destroy a 100 billion dollar company. It won't. They'll figure it out.
Grandparent mentioned "we", I guess they refer to a full class of "black hats" avoiding bad faith scraping that eventually could amass to a relatively effective volume of poisoned sites and/or feedback to the model.
Obviously a singular poisoned site will never make a difference in a dataset of billions and billions of tokens, much less destroy a 100bn company. That's a straw man, and I think people arguing about poisoning acknowledge that perfectly. But I'd argue they can eventually manage to at least do some little damage mostly for the lulz, while avoiding scraping.
Google is full of SEO manipulators and even when they recognize the problem and try to fix it, searching today is a mess because of that. Main difference and challenge in poisoning LLMs would be coordination between different actors, as there is no direct aligning incentive to poisoning except (arguably) global justified pettiness, unlike black hat SEO players that have the incentive to be the first result to certain query.
As LLMs become commonplace eventually new incentives may appear (i.e. an LLM showing a brand before others), and then, it could become a much bigger problem akin to Google's.
tl;dr: I wouldn't be so dismissive of what adversaries can manage to do with enough motivation.
There are ways to analyze that your contributions make sense from the conversation point of view. Reasoning detects that pretty quickly. To attack you would actually use another AI, to generate non totally random stuff. It still could be detected.
I would assume to use data they would have to filter it a lot and correlate between many users.
You can detect if the user is the real one and trust their other chats "a bit more".
Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.
That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.
AI models don't assume anything. AI models are just statistical tools. Their data is prepared by humans, who aren't morons. What is it with these super-ignorant AI critiques popping up everywhere?
There’s so much data required for training it’d be surprising humans look at even a small subset of it at all. They need different statistical tools to clean it up. That’s where attacks will be concentrated, naturally, and this is why synthetic data will overtake real human data, just after ‘there isn’t enough data even if it’s too much already’.
I'm not in the space either but I think the answer is an emphatic yes. Three categories come to mind:
1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).
2. Nation states like China who already require models to conform to state narratives.
3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.
There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."
Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?
If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.
Apparently, you haven't been on that Internet thingie in the last five years or so... :-)
But I do agree with your point. What's interesting is the increasing number of people who act like there's some clearly objective and knowable truth about a much a larger percentage of topics than there actually is. Outside of mathematics, logic, physics and other hard sciences, the range of topics on which informed, reasonable people can disagree, at least on certain significant aspects, is vast.
That's why even the concept of having some army of "Fact Checkers" always struck me as bizarre and doomed at best, and at worst, a transparent attempt to censor and control public discourse. That more people didn't see even the idea of it as being obviously brittle is concerning.
Bad or not, depends on your POV. But certainly there are efforts to feed junk to AI web scrapers, including specialized tools: https://zadzmo.org/code/nepenthes/
And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.
I deliberately pick wrong answers in reCAPTCHA sometimes. I’ve found out that the audio version accepts basically any string slightly resembling the audio, so that’s the easiest way. (Images on the other hand punish you pretty hard at times – even if you solve it correctly!)
> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.
Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs.
Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"
not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
A common use of these models is asking for code, and maybe you don't know the answer or would take a while to figure it out. For example, here's some html, make it blue and centered. You could give the model feedback on if its answer worked or not, without knowing the correct answer yourself ahead of time.
You constantly have to correct an AI when using it because it either didn't get the question right or you guide him towards a more narrowed answer. There is only more to learn.
I feel like conventional image search would be more reliable to get a good picture of a mushroom variety that you know about. Ideally going out into the woods to get one I suppose.
If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
> No, you're wrong, today's date is actually Wednesday the 29th.
>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
But thats exactly what you get when you ask questions that require shifting, specific contextual knowledge. The model weights, by their nature, cannot encode that information.
At best, you can only try to layer in contextual info like this as metadata during inference, akin to how other prompting layers exist.
Even then, what up-to-date information should present for every round-trip is a matter of opinion and use-case.
> Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
Such an ingenious attack, surely none of these companies ever considered it.
the date is in the "system prompt", so the cron job that updates the prompts to the current date may be in a different time zone than you.
7f5dbb71f54322f271c4d3fc3aaa4d3282a1af5541d82b2cbc5aa10c1420b6bc
They're not actually processing the entire system prompt (which is rather long) on every query, but continuing from a model state saved after processing the system prompt once.
That makes it a bit harder, but still, spitting out the wrong date just seems like a plain old time-zone bug.
ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?
That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.
our collective netflix thumbs up indicators gave investors and netflix the confidence to deploy a series of adam sandler movies that cost 60 to 80 million US dollars to "make". So depending on who you are, the system might be working great.
Through analytics Netflix should know exactly when people stop watching a series, or even when in a movie they exit out. They no doubt know this by user.
They know exactly what makes you stay, and what makes you leave.
I would not be surprised if in the near future movies and series are modifed on the fly to ensure users stay glued to their screens.
In the distant future this might be done on a per user level.
> I wonder if there is a cap to multi head attention architecture
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.
I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...
> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
> the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition
It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
It doesn't need much. 1 good lucky answer in a 1000 or maybe 10k queries gives you the little exponential kick you need to improve. This is how the hockey stick take off looks like and we're already here - OpenAI has it, now deepseek has it, too. You can be sure others also have it; Anthropic at the very least, they just never announced it officially, but go read what their CEO has been speaking and writing about.
it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")
"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"
Yet, they said when it was announced:
"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
These two statements are completely opposed. I can't take seriously anything this article says about o3.
No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
The claim is that this removes the human bottleneck (aka SFT or supervised fine tuning) on domains with a verifiable reward. Critically, this verifiable reward is extremely hard to pin down in nearly all domains besides mathematics and computer science.
It's also extremely hard to nail down in much of mathematics or computer science!
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence
Theorems discovery is amenable to verifiable rewards. But is meaningful theorems discovery too? Is the ability to discern between meaningful theorems and bad ones an emergent behaviour? You can check for yourself examples of automatic proofs, and the huge amount of intermediate theorems that they can generate which are not very meaningful.
IMHO, there are strategies that could extend this approach to many other domains.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).
The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.
I would say… it is at least somewhat similar.
A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y).
(Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
So, I would say there is a substantial similarity, but also a difference.
For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!
Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.
For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).
Not if the problem as written is "does this code compile", which is still a useful stepping stone for some workflows. Yours is certainly a more useful query in most cases but repositioning or re-scoping the original question can still lead to a net win.
It's not a sufficient criteria by itself, but where no better criteria is possible it would still produce better results in reinforcement learning than if the model has no reward for producing correctly compiling code vs code that failed to compile.
I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.
The idea that a lot of compute is moving towards inference has a huge consequence for the current "AI investments". This is bad news for NVDA particularly. The inference focused solutions have better economics than paying NVDA those huge margins (e.g. Grog)
Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.
> You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week
Nvidia sold $14.5 billion of datacenter hardware in the third quarter of their fiscal 2024 and that led to severe supply constraints, with estimate lead times for H100's up to 52 weeks some places, so no you can't, as that $14.5 billion was clearly capped by their ability to supply, not demand.
You're right, though, that Groq etc. can't deliver anywhere near the same volume now, but there's little reason to believe that will continue. There's no need for full GPU's for inference only workloads, so competitors can enter the space with a tiny proportion of functionality.
Their architecture means you buy them by the rack. Individual chips are useless, the magic happens when you set them up so each chip handles a subset of the model.
IOW, do you think groq’s 70B models run on 230MB of sram?
No idea about Groq, but Cerebras might give you a similar timeline than nVidia. Each of their wafers are 50x-100x H100s so they need to make less of them, in absolute units.
But cooling, power, etc... nVidia might have an advantage as their ecosystem is huge and more "liquid" in a sense.
Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.
Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.
Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.
My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.
Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
The calculus is even worse for SOTA LLMs.
The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.
2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.
In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.
Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
Apple's strategy already failed. Their big bet on NPU hardware did not pay off at all, and right now it's effectively wasted silicon on every iDevice while the GPU does all the heavy inference work. Now they partner with OpenAI to handle their inference (and even that's not good enough in many cases[0]). The "centralized compute" lobby is being paid by Apple to do the work their devices cannot.
Until Apple or AMD unifies their GPU architectures and implements complex streaming multiprocessors, Nvidia will remain in a class of their own. Apple used to lead the charge on the foremost CUDA alternative too, but then they abandoned it to focus on proprietary standards instead. It's pretty easy to argue that Apple shot themselves in the foot with every opportunity they had to compete on good faith. And make no mistake: Apple could have competed with Nvidia if they weren't so stubborn about Linux support and putting smartphone GPUs in laptops and desktops.
What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.
If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.
$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.
You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
> They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.
> The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.
I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
> As the comments on reddit said, those numbers don’t make sense.
Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.
>There are two major shifts happening in AI, economically speaking:
You can now spend more $ to get higher accuracy and reliability
Training $ is moving to inference $
>Both are going to drive a massive amount of demand for inference and neither will curtail the demand for more compute. In fact, they will increase the demand for compute.
Nvidia has much less of a moat on the inference side of things. Of course they still dominate the market right now for inference (in datacenters), but it's much easier for companies to move onto AMD or other solutions like Groq or whatever compared to trying to use non-Nvidia for training.
Overall good post, but feels like he has an axe to grind with LLMs to the point it is misleading:
> Last week, DeepSeek published their new R1-Zero and R1 “reasoner” systems that is competitive with OpenAI’s o1 system on ARC-AGI-1. R1-Zero, R1, and o1 (low compute) all score around 15-20% – in contrast to GPT-4o’s 5%, the pinnacle of years of pure LLM scaling
R1-zero gets 14% on private set which is the exact same score June Sonnet got; Sonnet, not 4o, is the pinnacle of pure LLM scaling
I predict that the future of LLM's when it comes to coding and software creation is in "custom individually tailored apps".
Imagine telling an AI agent what app you want, the requirements and all that and it just builds everything needed from backend to frontend, asks for your input on how things should work, clarifying questions etc.
It tests the software by compiling and running it reading errors and failed tests and fixing the code.
Then, it deploys the software in production for you. It compiles your app to an APK file and publishes it on the Google play store for example.
Sure an LLM now may still not be able to get everything perfect as far as it's outputs go. But surely there's already systems and workflows in place that will auto run your code, compile it, feed errors back to the LLM, some api to interact with cloud providers for hosting etc?
I have been trying to imagine something similar, but without all the middleware/distribution layer. You need to do a thing? The LLM just does it and presents the user with the desired experience. Kind of upending the notion that we need "apps" in the first place. It's all materialized, just-in-time style.
What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want? Where will the record of those clarifying questions and updates be kept? What if one developer asks the AI to surreptitiously round off pennies and put those pennies into their bank account? Where will that change be recorded, will humans be able to recognize it? What if two developers give it conflicting instructions? Who's reviewing this stream of instructions to the LLM?
"AI" driven programming has a long way to go before it is just a better code completion.
Plus coding (producing a working program that fits some requirement) is the least interesting part of software development. It adds complexity, bugs and maintenance.
> What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want?
You're wrong here. The entire point is that these are not computers as we used to think of them. These things have common sense; they can analyse a problem including all the implicit aspects, suggest and evaluate different implementation methods, architectures, interfaces.
So the right question is: "what's it called when you describe an app to a development team and they ask back questions and come back with designs and discuss them with you, and finally present you with an mvp, and then you iterate on that?"
Bold of you to imply that GPT asks questions instead of making baseless assumptions every 5 words, even when you explicitly instruct it to ask questions if it doesn't know. When it constantly hallucinates command line arguments and library methods instead of reading the fucking manual.
It's like outsourcing your project to [country where programmers are cheap]. You can't expect quality. Deep down you're actually amazed that the project builds at all. But it doesn't take much to reveal that it's just a facade for a generous serving of spaghetti and bugs.
And refactoring the project into something that won't crumble in 6 months requires more time than just redoing the project from scratch, because the technical debt is obscenely high, because those programmers were awful, and because no one, not even them, understands the code or wants to be the one who has to reverse engineer it.
Of course, but who's talking about today's tools? They're definitely not able to act like an independent, competent development team. Yet. But if we limit ourselves to the here-and-now, we might be like people talking about GPT3 five years ago: "yes it does spit out a few lines of code, which sometimes even compiles. When it doesn't forget half way and starts talking about unicorns".
We're talking about the tools of tomorrow, which, judging by the extremely rapid progress, I think is only a few (3-5) years away.
Anyway, I had great experiences with Claude and DeepSeek.
Most software is useful because a large number of people can interact with it or with each other over it. I'm not so certain that one-off software would be very useful for anyone beyond very simple functionality.
aider jams the backend on my PC, i have to kill the tcp connection or python to stop it running a GPU on the backend, from time to time. I can't imagine paying for tokens and not knowing if it's working or wasting money.
That's going to be much slower and more expensive than writing tests because image/video processing is slower and more expensive than writing tests. And because of lag in using the UI (and re-building the whole application from scratch after every change to test again).
Hm, what if instead of using video of the application…
Ok, so if one can have one program snoop on all the rendering calls made by another program, maybe there could be a way of training a common representation of “an image of an application” and “the rendering calls that are made when producing a frame of the display for the application”? Hopefully in a way that would be significantly smaller than the full image data.
If so, maybe rather than feeding in the video of the application, said representation could be applied to the rendering calls the application makes each frame, and this representation would be given as input as the model interacts with the application, rather than giving it the actual graphics?
But maybe this idea wouldn’t work at all, idk.
Like, I guess the rendering calls often involve image data in their arguments, and, you wouldn’t want to include the same images many time as the input to the encoding thing, as that would probably (or, I imagine) make it slower than just using the overall image of the application. I guess the calls are probably more pointing to the images in memory though, not putting an entire image on the stack.
I don’t know enough about low-level graphics programming to know if this idea of mine makes any sense.
>Ultimately, R1-Zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.
I would like this to be true, but doesn't the way they're doing RL also require tons of human data?
I think yes. But hopefully in math with compute advances we can lower the human data input by increasing the gap that is bridged by raw model capabilities vs search augmentation (either with tree search or full rollouts)
From what I read elsewhere (random reddit comment), the visible reasoning is just "for show" and isn't the process deepseek used to arrive at the result. But if the reasoning has value, I guess it doesn't matter even if it's fake.
Bad reddit comment though, try pair programming with it.
Reasoning usually comments on your request, extends it, figures out which solution is the best and usable, backtracks if finds issues implementing it, proposes a new solution and verifies that it kinda makes sense.
The result after that could actually look different though for usual questions (i.e. summarised in a way chatgpt answers on questions would look like).
But it is usually very coherent with the code part, so if for example it has to choose from two libraries - it will use the one from the reasoning part, of course.
R1's technical report (https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...) says the prompt used for training is "<think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant:" This prompt format strongly suggests that the text between <think> is made the "reasoning" and the text between <answer> is made the "answer" in the web app and API (https://api-docs.deepseek.com/guides/reasoning_model). I see no reason why deepseek should not do it this way, if not considering post-generation filtering.
Plus, if you read table 3 of the R1 technical report, which contains an example of R1's chain of thought, its style (going back to re-evaluating the problem) resembles what I actually got in the COT in the web app.
Fascinating. R1 really punches above its weight with respect to cost-per-token.
As the article alluded to at the end, my thoughts immediately go to using R1 as a data generator for complex problems, since we have many examples of successful distillation into smaller models on well-defined tasks.
That is a slight exaggeration, extrapolation on the author's part. What happened was that RL training led to some emergent behavior in R1-Zero (chain-of-thought, and reflection) without being prompted or trained for explicitly. Don't see what is so domain specific about that though.
Yeah, if I understand correctly AI will create it's own internal reasoning language through RL. In R1-Zero it was already a strange mix of languages. They corrected that for R1 to make the thinking useful for humans.
> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!
> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.
While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.
The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?
Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.
> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.
DeepSeek did precisely this with their LLama fine-tunes. You can try the 70B one here (might have to sign up): https://groq.com/groqcloud-makes-deepseek-r1-distill-llama-7...
Yes, but I meant it slightly differently than the distills.
The idea is to create the next gen SOTA non reasoning model with synthetic reasoning training data.
every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now
This assumes that you give honest feedback.
Efforts to feed deployed AI models various epistemic poisons abound in the wild.
> This assumes that you give honest feedback.
You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.
This assumes that the companies gathering the data don’t have silent ways of detecting bad actors and discarding their responses. If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known? Are you using a tool to generate this bad data, which might have detectable word frequency patterns that can be detected with something cheap like tf-idf?
There’s a lot of incentive to figure this out. And they have so much data coming in that they can likely afford to toss out some good data to ensure that they’re tossing out all of the bad.
> If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known?
We can use the same tactics they are using to crawl the web and scrape pages and bypass anti-scraping mechanisms.
Not necessarily, not all tactics can be used symmetrically like that. Many of the sites they scrape feel the need to support search engine crawlers and RSS crawlers, but OpenAI feels no such need to grant automated anonymous access to ChatGPT users.
And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.
> As long as the vast majority of their data is good (and it will be)
So expert answers are out of scope? Nice, looking forward to those quality data!
If you want to pick apart my hastily concocted examples, well, have fun I guess. My overall point is that ensuring data quality is something OpenAI is probably very good at. They likely have many clever techniques, some of which we could guess at, some of which would surprise us, all of which they’ve validated through extensive testing including with adversarial data.
If people want to keep playing pretend that their data poisoning efforts are causing real pain to OpenAI, they’re free to do so. I suppose it makes people feel good, and no one’s getting hurt here.
I'm interested in why you think OpenAI is probably very good at ensuring data quality. Also interested if you are trying to troll the resistance into revealing their working techniques.
What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?
Because paying experts for verified true data in the quantities they need isn't possible. Ilya himself said we've reached peak data (https://www.theverge.com/2024/12/13/24320811/what-ilya-sutsk...).
Why do you think we are stupid? We work at places developing these models and have a peek into how they're built...
Sure not necessarily the same tactics, but as with any hacking exercise, there are ways. We can become the 95% :)
It is absolutely fascinating to read the fantasy produced by people who (apparently) think they live in a sci-fi movie.
The companies whose datasets you're "poisoning" absolutely know about the attempts to poison data. All the ideas I've seen linked on this side so far about how they're going to totally defeat the AI companies' models sound like a mixture of wishful thinking and narcissism.
Are you suggesting some kind of invulnerability? People iterate their techniques, if big techs are so capable of avoiding poisoning/gaming attempts there would be no decades long tug-of-war between Google and black hat SEO manipulators.
Also I don't get the narcissism part. Would it be petty to poison a website only when looked by a spider? Yes, but I would also be that petty if some big company doesn't respect the boundaries I'm setting with my robots.txt on my 1-viewer cat photo blog.
Its not complete invulnerability. Instead, it is merely accepting that these methods might increase costs, like a little bit, but they don't cause the whole thing to explode.
The idea that a couple bad faith actions can destroy a 100 billion dollar company, is the extraordinary claim that requires extraordinary evidence.
Sure, bad actors can do a little damage. Just like bad actors can do DDoS attempts against Google. And that will cause a little damage. But mostly Google wins. Same thing applies to these AI companies.
> Also I don't get the narcissism part
The narcissism is the idea that your tiny website is going to destroy a 100 billion dollar company. It won't. They'll figure it out.
Grandparent mentioned "we", I guess they refer to a full class of "black hats" avoiding bad faith scraping that eventually could amass to a relatively effective volume of poisoned sites and/or feedback to the model.
Obviously a singular poisoned site will never make a difference in a dataset of billions and billions of tokens, much less destroy a 100bn company. That's a straw man, and I think people arguing about poisoning acknowledge that perfectly. But I'd argue they can eventually manage to at least do some little damage mostly for the lulz, while avoiding scraping.
Google is full of SEO manipulators and even when they recognize the problem and try to fix it, searching today is a mess because of that. Main difference and challenge in poisoning LLMs would be coordination between different actors, as there is no direct aligning incentive to poisoning except (arguably) global justified pettiness, unlike black hat SEO players that have the incentive to be the first result to certain query.
As LLMs become commonplace eventually new incentives may appear (i.e. an LLM showing a brand before others), and then, it could become a much bigger problem akin to Google's.
tl;dr: I wouldn't be so dismissive of what adversaries can manage to do with enough motivation.
[dead]
Who said they don't know? The same way companies know about hackers, it doesn't mean nothing ever gets hacked
There are ways to analyze that your contributions make sense from the conversation point of view. Reasoning detects that pretty quickly. To attack you would actually use another AI, to generate non totally random stuff. It still could be detected.
I would assume to use data they would have to filter it a lot and correlate between many users.
You can detect if the user is the real one and trust their other chats "a bit more".
Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.
That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.
The AI models to begin with assume that a significant majority of the training material is honest/in good faith. So that is not new?
AI models don't assume anything. AI models are just statistical tools. Their data is prepared by humans, who aren't morons. What is it with these super-ignorant AI critiques popping up everywhere?
There’s so much data required for training it’d be surprising humans look at even a small subset of it at all. They need different statistical tools to clean it up. That’s where attacks will be concentrated, naturally, and this is why synthetic data will overtake real human data, just after ‘there isn’t enough data even if it’s too much already’.
I am not in this space, question: are there "bad actors" that are known to feed AI models with poisonous information?
I'm not in the space either but I think the answer is an emphatic yes. Three categories come to mind:
1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).
2. Nation states like China who already require models to conform to state narratives.
3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.
There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."
The part where people disagree seems fun.
Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?
If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.
> The part where people disagree seems fun.
Apparently, you haven't been on that Internet thingie in the last five years or so... :-)
But I do agree with your point. What's interesting is the increasing number of people who act like there's some clearly objective and knowable truth about a much a larger percentage of topics than there actually is. Outside of mathematics, logic, physics and other hard sciences, the range of topics on which informed, reasonable people can disagree, at least on certain significant aspects, is vast.
That's why even the concept of having some army of "Fact Checkers" always struck me as bizarre and doomed at best, and at worst, a transparent attempt to censor and control public discourse. That more people didn't see even the idea of it as being obviously brittle is concerning.
On Wikipedia you are suppose to quote the different perspectives. No one has ever accomplished this.
We can trust altman and elon to weed out the "fakenews". Finally we will get the answer which is the greatest linux distro.
> Outside of mathematics, logic, physics
No need to go outside. There are plenty of Grigori Perelmans with various levels of credibility.
Bad or not, depends on your POV. But certainly there are efforts to feed junk to AI web scrapers, including specialized tools: https://zadzmo.org/code/nepenthes/
And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.
yes, example: me
I more often than not use the thumbs up on bad Google AI answers
(but not always! can't find me that easily!)
I deliberately pick wrong answers in reCAPTCHA sometimes. I’ve found out that the audio version accepts basically any string slightly resembling the audio, so that’s the easiest way. (Images on the other hand punish you pretty hard at times – even if you solve it correctly!)
Great arsticle from today: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...
Yeah, it's great comedy.
> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.
Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs. Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"
If the AI already has a larger knowledge domain space than the user then all users are bad actors. They are just too stupid to know it.
Creators who use Nightshade on their published works.
not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
A common use of these models is asking for code, and maybe you don't know the answer or would take a while to figure it out. For example, here's some html, make it blue and centered. You could give the model feedback on if its answer worked or not, without knowing the correct answer yourself ahead of time.
You constantly have to correct an AI when using it because it either didn't get the question right or you guide him towards a more narrowed answer. There is only more to learn.
>not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
For your example, what if you want to show what such a mushroom looks like to a friend? What if you want to use it on a website?
I feel like conventional image search would be more reliable to get a good picture of a mushroom variety that you know about. Ideally going out into the woods to get one I suppose.
Does it?
If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
Nah I just insult it and tell it that it costs me 20 dollars a month and it's a huge disappointment
> What is today's date?
>> Today's date is Tuesday, January 28, 2025.
> No, you're wrong, today's date is actually Wednesday the 29th.
>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
But thats exactly what you get when you ask questions that require shifting, specific contextual knowledge. The model weights, by their nature, cannot encode that information.
At best, you can only try to layer in contextual info like this as metadata during inference, akin to how other prompting layers exist.
Even then, what up-to-date information should present for every round-trip is a matter of opinion and use-case.
> Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
Such an ingenious attack, surely none of these companies ever considered it.
the date is in the "system prompt", so the cron job that updates the prompts to the current date may be in a different time zone than you. 7f5dbb71f54322f271c4d3fc3aaa4d3282a1af5541d82b2cbc5aa10c1420b6bc
why can't they feed in user data like time zone and locale?
They're not actually processing the entire system prompt (which is rather long) on every query, but continuing from a model state saved after processing the system prompt once.
That makes it a bit harder, but still, spitting out the wrong date just seems like a plain old time-zone bug.
If such labels are collected and used to retrain the model then yes. But these models are not learning online.
ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?
Really? Isn't that the point of RL used in the way R1 did?
Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?
I believe that's what GP meant by "respond", not telling GPT they were wrong.
That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.
Yes, that makes sense. We're both talking about offline learning.
So if I just pay OpenAI $200/mo, and randomly tell the AI, no that's wrong.
I can stop the AI takeover?
You would need a lot of pro accounts! I would be surprised if they didn't use any algorithms for detecting well poisoning.
You can have our thank-you cards forwarded to your cell at Guantanamo Bay.
> you provide a very valuable piece of data to train on
We've been saying this "we get valuable data" thing since the 2010s [1].
When will our collective Netflix thumbs ups give us artificial super-intelligence?
[1] Especially to investors. They love that line.
our collective netflix thumbs up indicators gave investors and netflix the confidence to deploy a series of adam sandler movies that cost 60 to 80 million US dollars to "make". So depending on who you are, the system might be working great.
Through analytics Netflix should know exactly when people stop watching a series, or even when in a movie they exit out. They no doubt know this by user.
They know exactly what makes you stay, and what makes you leave.
I would not be surprised if in the near future movies and series are modifed on the fly to ensure users stay glued to their screens.
In the distant future this might be done on a per user level.
> I wonder if there is a cap to multi head attention architecture
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.
And how would that work at inference time?
I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...
> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
> the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition
It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
It doesn't need much. 1 good lucky answer in a 1000 or maybe 10k queries gives you the little exponential kick you need to improve. This is how the hockey stick take off looks like and we're already here - OpenAI has it, now deepseek has it, too. You can be sure others also have it; Anthropic at the very least, they just never announced it officially, but go read what their CEO has been speaking and writing about.
> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.
Why is it promising, aren’t you potentially amplifying AI biases and errors?
it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")
https://newsletter.languagemodels.co/i/155812052/large-scale...
also from the posted article
"""
The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.
This makes intuitive sense, as language itself is effectively a reasoning DSL.
"""
"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"
Yet, they said when it was announced:
"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
These two statements are completely opposed. I can't take seriously anything this article says about o3.
No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
They are testing with a different dataset. The authors saying that they have not tested on the version of o3 that has not seen the training set.
Your quote is accurate from here:
https://arcprize.org/blog/oai-o3-pub-breakthrough
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
The claim is that this removes the human bottleneck (aka SFT or supervised fine tuning) on domains with a verifiable reward. Critically, this verifiable reward is extremely hard to pin down in nearly all domains besides mathematics and computer science.
It's also extremely hard to nail down in much of mathematics or computer science!
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence
Theorems discovery is amenable to verifiable rewards. But is meaningful theorems discovery too? Is the ability to discern between meaningful theorems and bad ones an emergent behaviour? You can check for yourself examples of automatic proofs, and the huge amount of intermediate theorems that they can generate which are not very meaningful.
IMHO, there are strategies that could extend this approach to many other domains.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
Reasoning transfers across domains.
See https://www.interconnects.ai/p/why-reasoning-models-will-gen... for more information.
By verifiable do they mean it in the complexity theory P/NP sense of the word?
In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).
The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.
I would say… it is at least somewhat similar.
A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y). (Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
So, I would say there is a substantial similarity, but also a difference.
For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!
Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.
For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).
Isn't "code compiles" an insufficient criteria?
e.g you would need to prove that for all inputs the code produces the correct output which would in turn make the problem way more complex
Not if the problem as written is "does this code compile", which is still a useful stepping stone for some workflows. Yours is certainly a more useful query in most cases but repositioning or re-scoping the original question can still lead to a net win.
It's not a sufficient criteria by itself, but where no better criteria is possible it would still produce better results in reinforcement learning than if the model has no reward for producing correctly compiling code vs code that failed to compile.
They mean that the solutions can be verified to be correct in a binary sense. E.g. a coding solution passes all the unit tests vs writing poetry.
> R1-Zero removes the human bottleneck
I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.
The idea that a lot of compute is moving towards inference has a huge consequence for the current "AI investments". This is bad news for NVDA particularly. The inference focused solutions have better economics than paying NVDA those huge margins (e.g. Grog)
Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
For inference Nvidia has more significant competition than for training. See Groq, Google's TPU's etc.
People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.
> You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week
Nvidia sold $14.5 billion of datacenter hardware in the third quarter of their fiscal 2024 and that led to severe supply constraints, with estimate lead times for H100's up to 52 weeks some places, so no you can't, as that $14.5 billion was clearly capped by their ability to supply, not demand.
You're right, though, that Groq etc. can't deliver anywhere near the same volume now, but there's little reason to believe that will continue. There's no need for full GPU's for inference only workloads, so competitors can enter the space with a tiny proportion of functionality.
Groq chips have 230 mb of sram memory. Good luck running 670B model on those chips, even without supply constraints.
Their architecture means you buy them by the rack. Individual chips are useless, the magic happens when you set them up so each chip handles a subset of the model.
IOW, do you think groq’s 70B models run on 230MB of sram?
Nvidia's H100 has 80GB. As long the interconnect is fast enough you don't need everything to fit on one.
You mean Cerebras.
>call up Nvidia and order $10B worth of GPUs
Doubt it.
No idea about Groq, but Cerebras might give you a similar timeline than nVidia. Each of their wafers are 50x-100x H100s so they need to make less of them, in absolute units.
But cooling, power, etc... nVidia might have an advantage as their ecosystem is huge and more "liquid" in a sense.
Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.
> OpenAI, Meta, AWS, AMD, and others have long attempted to eliminate the Nvidia tax, yet failed.
Gemini / Google runs and trains on TPUs.
You have no incentive to infer on AMD if you need to buy a massive Nvidia cluster to train.
Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
Google has a self inflicted wound in the time to get an api key.
> yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
It's almost as if being a first-mover is more important than whether or not you use CUDA.
Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.
Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.
Meta trains on Nvidia and infers on AMD. There is incentive if your inference costs are high.
Meta also has a second generation of their own AI accelerator chips designed.
My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.
Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
The calculus is even worse for SOTA LLMs.
The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
llm inference is fine on rocm. llama.cpp and vllm both have very good rocm support.
llm training is also mostly fine. I have not encountered any issues yet.
most of the cuda moat comes from people who are repeating what they heard 5-10 years ago.
It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
I think future of inference is on the client side
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
I would argue that both things are true:
1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.
2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.
In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.
This is much more nuanced now. See Apple "Private Cloud Compute": https://security.apple.com/blog/private-cloud-compute/ ; they run a lot of the larger models on their own servers.
Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
Apple's strategy already failed. Their big bet on NPU hardware did not pay off at all, and right now it's effectively wasted silicon on every iDevice while the GPU does all the heavy inference work. Now they partner with OpenAI to handle their inference (and even that's not good enough in many cases[0]). The "centralized compute" lobby is being paid by Apple to do the work their devices cannot.
Until Apple or AMD unifies their GPU architectures and implements complex streaming multiprocessors, Nvidia will remain in a class of their own. Apple used to lead the charge on the foremost CUDA alternative too, but then they abandoned it to focus on proprietary standards instead. It's pretty easy to argue that Apple shot themselves in the foot with every opportunity they had to compete on good faith. And make no mistake: Apple could have competed with Nvidia if they weren't so stubborn about Linux support and putting smartphone GPUs in laptops and desktops.
[0] https://apnews.com/article/apple-ai-news-hallucinations-ipho...
Which AMD GPU gives you 50 tok/s on a 30b model? My 3090 does 30 tok/s with a 4 bit quant.
I don't mean at the same time.
For a simple question, with RX 6800, I am observing ~50 tok/s on 8B models Deepseek 16B gives ~40 tok/s. 32B doesn't fit in memory
So far it's moving towards test time compute true, but reasoning models are still far too large to be done on the edge.
Well o3 scored 75% on AGI-1, R1 and o1 only 25%.... watch this space though....
What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
That could also definitely make sense if the SOTA models are too slow and expensive to be popular with a general audience.
Yeah, but they can use DeepSeek's new algorithm too.
with 57 million(!!) tokens
From the article :
o3 (low) 75.7% 335K $20
o3 (high) 87.5% 57M $3.4K
When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.
If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.
$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.
You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.
You absolutely do not get a "guarantee" of correctness (event with the airquotes) from any lawyer.
What’s the liability insurance of the AI like
Refer to IBM’s 1979 slide for details on that
I view it as a positive that the methodology can take in more compute (bitter lesson style)
But can o3 write a symphony?
Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.
In my view there's two modes of creativity:
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
Have you tried suno.ai?
Have _you_? It lost its novelty after a couple of days.
LLMs have read everything humans made so just ask one if there’s anything truly new in that freshly confabulated slop-phony.
we'd have to create a numerical scale for creativity, from boring to Dali, with milliEschers and MegaGeigers somewhere in there as well
It's essential that we quantify everything so that we can put a price on it. I'd go with Kahlograms though.
Mike from Baseten here
We're super proud to support this work. If you're thinking of running deepseek in production, give us a shout!
We currently evaluate DeepSeek-R1 for our production system. We aren't done yet, but I think it's a match.
Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
Can you share at a high level how you run this model?
We know it’s 671B params with each MOE node at 37B…
If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU?
How much do interconnects hurt performance vs being able to load the model into a single GPU?
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
Any idea of what they're doing wrong?
[1]: https://www.reddit.com/r/LocalLLaMA/comments/1icphqa/how_to_...
[2]: https://unsloth.ai/blog/deepseekr1-dynamic
They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
> They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.
> The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.
I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
> As the comments on reddit said, those numbers don’t make sense.
Absolutely, hence my question!
Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.
>There are two major shifts happening in AI, economically speaking:
>Both are going to drive a massive amount of demand for inference and neither will curtail the demand for more compute. In fact, they will increase the demand for compute.Is this Nvidia compute or something else?
Nvidia has much less of a moat on the inference side of things. Of course they still dominate the market right now for inference (in datacenters), but it's much easier for companies to move onto AMD or other solutions like Groq or whatever compared to trying to use non-Nvidia for training.
Overall good post, but feels like he has an axe to grind with LLMs to the point it is misleading:
> Last week, DeepSeek published their new R1-Zero and R1 “reasoner” systems that is competitive with OpenAI’s o1 system on ARC-AGI-1. R1-Zero, R1, and o1 (low compute) all score around 15-20% – in contrast to GPT-4o’s 5%, the pinnacle of years of pure LLM scaling
R1-zero gets 14% on private set which is the exact same score June Sonnet got; Sonnet, not 4o, is the pinnacle of pure LLM scaling
I predict that the future of LLM's when it comes to coding and software creation is in "custom individually tailored apps". Imagine telling an AI agent what app you want, the requirements and all that and it just builds everything needed from backend to frontend, asks for your input on how things should work, clarifying questions etc.
It tests the software by compiling and running it reading errors and failed tests and fixing the code.
Then, it deploys the software in production for you. It compiles your app to an APK file and publishes it on the Google play store for example.
Sure an LLM now may still not be able to get everything perfect as far as it's outputs go. But surely there's already systems and workflows in place that will auto run your code, compile it, feed errors back to the LLM, some api to interact with cloud providers for hosting etc?
A little further out from that could be the LLM acting as the runtime environment. No code. It's just data in (user inputs etc) -> GUI out.
Most people really do not know what they want at any level of detail.
It's ok, they'll know it when they see it. Keep trying.
The future is bespoke software.
In some sense, this is how computers were always supposed to work!
I have been trying to imagine something similar, but without all the middleware/distribution layer. You need to do a thing? The LLM just does it and presents the user with the desired experience. Kind of upending the notion that we need "apps" in the first place. It's all materialized, just-in-time style.
What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want? Where will the record of those clarifying questions and updates be kept? What if one developer asks the AI to surreptitiously round off pennies and put those pennies into their bank account? Where will that change be recorded, will humans be able to recognize it? What if two developers give it conflicting instructions? Who's reviewing this stream of instructions to the LLM?
"AI" driven programming has a long way to go before it is just a better code completion.
That.
Plus coding (producing a working program that fits some requirement) is the least interesting part of software development. It adds complexity, bugs and maintenance.
> What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want?
You're wrong here. The entire point is that these are not computers as we used to think of them. These things have common sense; they can analyse a problem including all the implicit aspects, suggest and evaluate different implementation methods, architectures, interfaces.
So the right question is: "what's it called when you describe an app to a development team and they ask back questions and come back with designs and discuss them with you, and finally present you with an mvp, and then you iterate on that?"
Bold of you to imply that GPT asks questions instead of making baseless assumptions every 5 words, even when you explicitly instruct it to ask questions if it doesn't know. When it constantly hallucinates command line arguments and library methods instead of reading the fucking manual.
It's like outsourcing your project to [country where programmers are cheap]. You can't expect quality. Deep down you're actually amazed that the project builds at all. But it doesn't take much to reveal that it's just a facade for a generous serving of spaghetti and bugs.
And refactoring the project into something that won't crumble in 6 months requires more time than just redoing the project from scratch, because the technical debt is obscenely high, because those programmers were awful, and because no one, not even them, understands the code or wants to be the one who has to reverse engineer it.
Except that AI is actually MUCH more expensive!
Of course, but who's talking about today's tools? They're definitely not able to act like an independent, competent development team. Yet. But if we limit ourselves to the here-and-now, we might be like people talking about GPT3 five years ago: "yes it does spit out a few lines of code, which sometimes even compiles. When it doesn't forget half way and starts talking about unicorns".
We're talking about the tools of tomorrow, which, judging by the extremely rapid progress, I think is only a few (3-5) years away.
Anyway, I had great experiences with Claude and DeepSeek.
Most software is useful because a large number of people can interact with it or with each other over it. I'm not so certain that one-off software would be very useful for anyone beyond very simple functionality.
This will almost certainly never materialize, and the reasons are not just technical
Have you tried https://bolt.diy ?
It does what you describe
It claims to do what he describes.
> Imagine telling an AI agent … requirements… asks for your input on how things should work, clarifying questions etc.
That’s hard work. I watch people do that every day, and always get something wrong.
Also what about deploying the application, paying for database or cloud resource that will run it, etc?
> auto run your code, compile it, feed errors back to the LLM,
Can't wait for companies to juice profits by having the LLM run excessive cycles or get stuck in a loop and run up my bill
aider jams the backend on my PC, i have to kill the tcp connection or python to stop it running a GPU on the backend, from time to time. I can't imagine paying for tokens and not knowing if it's working or wasting money.
The loops and constant useless changes drive me nuts haha
It doesn't need to write tests: it can just use the application and figure out if it works.
That's going to be much slower and more expensive than writing tests because image/video processing is slower and more expensive than writing tests. And because of lag in using the UI (and re-building the whole application from scratch after every change to test again).
Hm, what if instead of using video of the application…
Ok, so if one can have one program snoop on all the rendering calls made by another program, maybe there could be a way of training a common representation of “an image of an application” and “the rendering calls that are made when producing a frame of the display for the application”? Hopefully in a way that would be significantly smaller than the full image data.
If so, maybe rather than feeding in the video of the application, said representation could be applied to the rendering calls the application makes each frame, and this representation would be given as input as the model interacts with the application, rather than giving it the actual graphics?
But maybe this idea wouldn’t work at all, idk.
Like, I guess the rendering calls often involve image data in their arguments, and, you wouldn’t want to include the same images many time as the input to the encoding thing, as that would probably (or, I imagine) make it slower than just using the overall image of the application. I guess the calls are probably more pointing to the images in memory though, not putting an entire image on the stack.
I don’t know enough about low-level graphics programming to know if this idea of mine makes any sense.
But it’s actually correct from a usability perspective
I mean, we're halfway there, with aider and open-interpreter, just give it a couple of years
>Ultimately, R1-Zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.
I would like this to be true, but doesn't the way they're doing RL also require tons of human data?
I think yes. But hopefully in math with compute advances we can lower the human data input by increasing the gap that is bridged by raw model capabilities vs search augmentation (either with tree search or full rollouts)
It's a bit deceptive that o3 conveniently had access to ARC-prize-specific training material while r1 probably didn't. [0]
[0] https://news.ycombinator.com/item?id=42763231
I think deepseek accidentally also killed google for me, not just chatgpt. Because of the visible reasoning part.
From what I read elsewhere (random reddit comment), the visible reasoning is just "for show" and isn't the process deepseek used to arrive at the result. But if the reasoning has value, I guess it doesn't matter even if it's fake.
Bad reddit comment though, try pair programming with it. Reasoning usually comments on your request, extends it, figures out which solution is the best and usable, backtracks if finds issues implementing it, proposes a new solution and verifies that it kinda makes sense.
The result after that could actually look different though for usual questions (i.e. summarised in a way chatgpt answers on questions would look like). But it is usually very coherent with the code part, so if for example it has to choose from two libraries - it will use the one from the reasoning part, of course.
Can you provide a link to the comment?
R1's technical report (https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...) says the prompt used for training is "<think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant:" This prompt format strongly suggests that the text between <think> is made the "reasoning" and the text between <answer> is made the "answer" in the web app and API (https://api-docs.deepseek.com/guides/reasoning_model). I see no reason why deepseek should not do it this way, if not considering post-generation filtering.
Plus, if you read table 3 of the R1 technical report, which contains an example of R1's chain of thought, its style (going back to re-evaluating the problem) resembles what I actually got in the COT in the web app.
Fascinating. R1 really punches above its weight with respect to cost-per-token.
As the article alluded to at the end, my thoughts immediately go to using R1 as a data generator for complex problems, since we have many examples of successful distillation into smaller models on well-defined tasks.
Just because I'm an incurable cynic, has anybody run wireshark on it and checked that it actually does process entirely offline?
> The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.
Um, what’s that now? Really?
That is a slight exaggeration, extrapolation on the author's part. What happened was that RL training led to some emergent behavior in R1-Zero (chain-of-thought, and reflection) without being prompted or trained for explicitly. Don't see what is so domain specific about that though.
Yeah, if I understand correctly AI will create it's own internal reasoning language through RL. In R1-Zero it was already a strange mix of languages. They corrected that for R1 to make the thinking useful for humans.
Funnily they didn't exclude anything forbidden from the training dataset. It will gladly tell you about the Tienanmen Massacre and what not.
[flagged]
[dead]
[flagged]
Anas