An analysis of DeepSeek's R1-Zero and R1

732 points by meetpateltech 9 months ago

spyckie2 9 months ago

> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!

> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.

While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.

We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.

The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?

Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.

sheepscreek 9 months ago

> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.
DeepSeek did precisely this with their LLama fine-tunes. You can try the 70B one here (might have to sign up): https://groq.com/groqcloud-makes-deepseek-r1-distill-llama-7...
- spyckie2 9 months ago
  
  Yes, but I meant it slightly differently than the distills.
  The idea is to create the next gen SOTA non reasoning model with synthetic reasoning training data.
  - entropicdrifter 9 months ago
    
    So you mean something like, "what if the baseline, off-the-cuff response for the next-gen models was tuned based on the results of the reasoning model excluding the reasoning itself?"
    
    spyckie2 9 months ago
    
    Exactly, albeit it may need the reasoning later to form the proper foundational logic in the weights.
mohsen1 9 months ago

every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now
- merrywhether 9 months ago
  
  Users can be adversarial to the “truth” (to the extent it exists) without being adversarial in intent.
  Dinosaur bones are either 65 million year old remnants of ancient creatures or decoys planted by a God during a 7 day creation, and a large proportion of humans earnestly believe either take. Choosing which of these to believe involves a higher level decision about fundamental worldviews. This is an extreme example, but incorporating “honest” human feedback on vaccines, dark matter, and countless other topics won’t lead to de facto improvements.
  I guess to put it another way: experts don’t learn from the masses. The average human isn’t an expert in anything, so incorporating the average feedback will pull a model away from expertise (imagine asking 100 people to give you grammar advice). You’d instead want to identify expert advice, but that’s impossible to do from looking at the advice itself without giving into a confirmation bias spiral. Humans use meta-signals like credentialing to augment their perception of received information, yet I doubt we’ll be having people upload their CV during signup to a chat service.
  And at the cutting edge level of expertise, the only real “knowledgeable” counterparties are the physical systems of reality themselves. I’m curious how takeoff is possible for a brain in a bottle that can’t test and verify any of its own conjectures. It can continually extrapolate down chains of thought, but that’s most likely to just carry and amplify errors.
  - benbosco 9 months ago
    
    Dirac’s prediction of antimatter came from purely mathematical reasoning—before any experimental evidence existed. Testing and verifying conjectures requires the ability to extrapolate beyond known data, rather than from it, and the ability discard false leads based on theoretical reasoning, rather than statistical confidence.
    All of this is possible in a bottle, but laughably far beyond our current capabilities.
  - sirsinsalot 9 months ago
    
    This is a good take. What models seem to be poor at is undoing their own thinking down a path even when they can test.
    If you let a model write code, test it, identify bugs and fix them, you get an increasingly obtuse and complex code base where errors happen more. The more it iterate the worse it gets.
    At the end of the day, written human language is a poor way of describing software. Even to a model. The code is the description.
    At the moment we describe solutions we want to see to the models and they aren't that smart about translating that to an unambiguous form.
    We are a long was off describing the problems and asking for a solution. Even when the model can test and iterate.
  - econ 9 months ago
    
    Same way corporations do it, they hire humans and other companies to do things. Organisations already have a mind of their own with more drive to survive than an llm.
- nine_k 9 months ago
  
  This assumes that you give honest feedback.
  Efforts to feed deployed AI models various epistemic poisons abound in the wild.
  - Uehreka 9 months ago
    
    This assumes that the companies gathering the data don’t have silent ways of detecting bad actors and discarding their responses. If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known? Are you using a tool to generate this bad data, which might have detectable word frequency patterns that can be detected with something cheap like tf-idf?
    There’s a lot of incentive to figure this out. And they have so much data coming in that they can likely afford to toss out some good data to ensure that they’re tossing out all of the bad.
    
    aprilthird2021 9 months ago
    
    > If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known?
    We can use the same tactics they are using to crawl the web and scrape pages and bypass anti-scraping mechanisms.
    
    Uehreka 9 months ago
    
    Not necessarily, not all tactics can be used symmetrically like that. Many of the sites they scrape feel the need to support search engine crawlers and RSS crawlers, but OpenAI feels no such need to grant automated anonymous access to ChatGPT users.
    And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.
    
    whilenot-dev 9 months ago
    
    > As long as the vast majority of their data is good (and it will be)
    So expert answers are out of scope? Nice, looking forward to those quality data!
    
    Uehreka 9 months ago
    
    If you want to pick apart my hastily concocted examples, well, have fun I guess. My overall point is that ensuring data quality is something OpenAI is probably very good at. They likely have many clever techniques, some of which we could guess at, some of which would surprise us, all of which they’ve validated through extensive testing including with adversarial data.
    If people want to keep playing pretend that their data poisoning efforts are causing real pain to OpenAI, they’re free to do so. I suppose it makes people feel good, and no one’s getting hurt here.
    
    pigeons 9 months ago
    
    I'm interested in why you think OpenAI is probably very good at ensuring data quality. Also interested if you are trying to troll the resistance into revealing their working techniques.
    
    kleton 9 months ago
    
    They buy it through scale ai
    
    halfadot 9 months ago
    
    What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?
    
    aprilthird2021 9 months ago
    
    Because paying experts for verified true data in the quantities they need isn't possible. Ilya himself said we've reached peak data (https://www.theverge.com/2024/12/13/24320811/what-ilya-sutsk...).
    Why do you think we are stupid? We work at places developing these models and have a peek into how they're built...
    
    nyrikki 9 months ago
    
    You see a rowboat, and you need to cross the river.
    Ask a dozen experts to decide what that boat needs to fit your need.
    That is the specification problem, add on the frame problem and it becomes intractable.
    Add in domain specific terms and conflicts and it becomes even more difficult.
    Any nontrivial semantic properties, those without a clear T/F are undecidable.
    OpenAI with have to do what they can, but it is not trivial or solvable.
    It doesn't matter how smart they are, generalized solutions are hard.
    
    aprilthird2021 9 months ago
    
    Sure not necessarily the same tactics, but as with any hacking exercise, there are ways. We can become the 95% :)
    
    halfadot 9 months ago
    
    It is absolutely fascinating to read the fantasy produced by people who (apparently) think they live in a sci-fi movie.
    The companies whose datasets you're "poisoning" absolutely know about the attempts to poison data. All the ideas I've seen linked on this side so far about how they're going to totally defeat the AI companies' models sound like a mixture of wishful thinking and narcissism.
    
    mrbungie 9 months ago
    
    Are you suggesting some kind of invulnerability? People iterate their techniques, if big techs are so capable of avoiding poisoning/gaming attempts there would be no decades long tug-of-war between Google and black hat SEO manipulators.
    Also I don't get the narcissism part. Would it be petty to poison a website only when looked by a spider? Yes, but I would also be that petty if some big company doesn't respect the boundaries I'm setting with my robots.txt on my 1-viewer cat photo blog.
    
    stale2002 9 months ago
    
    Its not complete invulnerability. Instead, it is merely accepting that these methods might increase costs, like a little bit, but they don't cause the whole thing to explode.
    The idea that a couple bad faith actions can destroy a 100 billion dollar company, is the extraordinary claim that requires extraordinary evidence.
    Sure, bad actors can do a little damage. Just like bad actors can do DDoS attempts against Google. And that will cause a little damage. But mostly Google wins. Same thing applies to these AI companies.
    > Also I don't get the narcissism part
    The narcissism is the idea that your tiny website is going to destroy a 100 billion dollar company. It won't. They'll figure it out.
    
    mrbungie 9 months ago
    
    Grandparent mentioned "we", I guess they refer to a full class of "black hats" avoiding bad faith scraping that eventually could amass to a relatively effective volume of poisoned sites and/or feedback to the model.
    Obviously a singular poisoned site will never make a difference in a dataset of billions and billions of tokens, much less destroy a 100bn company. That's a straw man, and I think people arguing about poisoning acknowledge that perfectly. But I'd argue they can eventually manage to at least do some little damage mostly for the lulz, while avoiding scraping.
    Google is full of SEO manipulators and even when they recognize the problem and try to fix it, searching today is a mess because of that. Main difference and challenge in poisoning LLMs would be coordination between different actors, as there is no direct aligning incentive to poisoning except (arguably) global justified pettiness, unlike black hat SEO players that have the incentive to be the first result to certain query.
    As LLMs become commonplace eventually new incentives may appear (i.e. an LLM showing a brand before others), and then, it could become a much bigger problem akin to Google's.
    tl;dr: I wouldn't be so dismissive of what adversaries can manage to do with enough motivation.
    
    nine_k 9 months ago
    
    Global coordination for lulz exists, it's called "memes".
    Remember Dogecoin or Gamestop; the lulz-oriented meme outbursts had a real impact.
    Equally, a particular way to gaslight LLM scrapers may become popular and widespread without any enforcement.
    
    mrbungie 9 months ago
    
    Didn't think of it that way, but I think you're right. As long as memes exist one could argue the LLMs are going to be poisoned in one way or another.
    
    panloss125 9 months ago
    
    [dead]
    
    achierius 9 months ago
    
    As someone who works in big tech on a product with a large attack surface -- security is a huge chunk of our costs in multiple ways
    - Significant fraction of all developer time (30%+ just on my team?) - Huge increase to the complexity of the system - Large accumulated performance cost over time
    Obviously it's not a 1-to-1 analogy but if we didn't have to worry about this sort of prodding we would be able to do a lot more with our time. Point being that it's probably closer to a 2x cost factor than it is to a 1% increase.
    
    aprilthird2021 9 months ago
    
    Who said they don't know? The same way companies know about hackers, it doesn't mean nothing ever gets hacked
  - visarga 9 months ago
    
    > This assumes that you give honest feedback.
    You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
    Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
    You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
    Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.
    
    stared 9 months ago
    
    It goes then in the line of https://xkcd.com/810/
  - octacat 9 months ago
    
    There are ways to analyze that your contributions make sense from the conversation point of view. Reasoning detects that pretty quickly. To attack you would actually use another AI, to generate non totally random stuff. It still could be detected.
    I would assume to use data they would have to filter it a lot and correlate between many users.
    You can detect if the user is the real one and trust their other chats "a bit more".
    
    merrywhether 9 months ago
    
    You would have to grade every user on every knowledge axis though. Just because someone is an expert in software doesn’t mean you should believe their takes on medicine, no matter how good faith their model interactions appear. I’d argue that coming up with an automated way to determine the objective truthfulness of information would be among the greatest creations of humanity (basically “solving” philosophy), so this isn’t a small task.
    
    linguistbreaker 9 months ago
    
    I've been thinking about how this happens with human cognitive development. There's a constant reinforcement mechanism that simply compares one's predicted reality with actual reality. The machines lack an authoritative reality.
    If we had to grade truthiness of data sources - our sight or other main senses would probably be #1. Some gossip we heard from a 6 year old is near the bottom.
    We know how to grade these data sources based on longitudinal experience and they are graded on multiple axes. For instance Angela is wrong about most facts but always right about matters of the heart.
    
    octacat 9 months ago
    
    of course. Each user input would be compared with other user input and existing data in the model before. Only legit and cross-referenced data could be used. Other data could still be used but marked as "possible controversial data". Good model should know that controversial data exists too and should distinguish it from the proper scientific data on each topic.
  - scarmig 9 months ago
    
    Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.
    That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.
  - hammock 9 months ago
    
    The AI models to begin with assume that a significant majority of the training material is honest/in good faith. So that is not new?
    
    halfadot 9 months ago
    
    AI models don't assume anything. AI models are just statistical tools. Their data is prepared by humans, who aren't morons. What is it with these super-ignorant AI critiques popping up everywhere?
    
    baq 9 months ago
    
    There’s so much data required for training it’d be surprising humans look at even a small subset of it at all. They need different statistical tools to clean it up. That’s where attacks will be concentrated, naturally, and this is why synthetic data will overtake real human data, just after ‘there isn’t enough data even if it’s too much already’.
    
    hammock 9 months ago
    
    Try a little benefit of the doubt, nuance or colloquialism. Or a bit of all three.
  - BorisMelnik 9 months ago
    
    I am not in this space, question: are there "bad actors" that are known to feed AI models with poisonous information?
    
    mrandish 9 months ago
    
    I'm not in the space either but I think the answer is an emphatic yes. Three categories come to mind:
    1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).
    2. Nation states like China who already require models to conform to state narratives.
    3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.
    There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."
    
    theendisney 9 months ago
    
    The part where people disagree seems fun.
    Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?
    If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.
    
    mrandish 9 months ago
    
    > The part where people disagree seems fun.
    Apparently, you haven't been on that Internet thingie in the last five years or so... :-)
    But I do agree with your point. What's interesting is the increasing number of people who act like there's some clearly objective and knowable truth about a much a larger percentage of topics than there actually is. Outside of mathematics, logic, physics and other hard sciences, the range of topics on which informed, reasonable people can disagree, at least on certain significant aspects, is vast.
    That's why even the concept of having some army of "Fact Checkers" always struck me as bizarre and doomed at best, and at worst, a transparent attempt to censor and control public discourse. That more people didn't see even the idea of it as being obviously brittle is concerning.
    
    theendisney 9 months ago
    
    On Wikipedia you are suppose to quote the different perspectives. No one has ever accomplished this.
    We can trust altman and elon to weed out the "fakenews". Finally we will get the answer which is the greatest linux distro.
    > Outside of mathematics, logic, physics
    No need to go outside. There are plenty of Grigori Perelmans with various levels of credibility.
    
    kristofferR 9 months ago
    
    Great arsticle from today: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...
    
    halfadot 9 months ago
    
    Yeah, it's great comedy.
    > Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.
    Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs. Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"
    
    nine_k 9 months ago
    
    Bad or not, depends on your POV. But certainly there are efforts to feed junk to AI web scrapers, including specialized tools: https://zadzmo.org/code/nepenthes/
    
    halfadot 9 months ago
    
    And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.
    
    tsunamifury 9 months ago
    
    If the AI already has a larger knowledge domain space than the user then all users are bad actors. They are just too stupid to know it.
    
    enjo 9 months ago
    
    I would not really classify them as "bad" actors, but there are definitely real research lines into this. This freakonomics podcast (https://freakonomics.com/podcast/how-to-poison-an-a-i-machin...) is a pretty good interview with Ben Zhao at the University of Chicago. He runs a lab that is attempting to figure out how to trip up model training when copyrighted material is being used.
    
    immibis 9 months ago
    
    Creators who use Nightshade on their published works.
    
    blibble 9 months ago
    
    yes, example: me
    I more often than not use the thumbs up on bad Google AI answers
    (but not always! can't find me that easily!)
    
    notpushkin 9 months ago
    
    I deliberately pick wrong answers in reCAPTCHA sometimes. I’ve found out that the audio version accepts basically any string slightly resembling the audio, so that’s the easiest way. (Images on the other hand punish you pretty hard at times – even if you solve it correctly!)
    
    cgriswald 9 months ago
    
    For images ones I have to turn off my brain. “Select all images that contain a crosswalk.”
    What about unmarked crosswalks? Does it have to contain the crosswalk in whole or in part? That bit of white stripping is there just on the edge of this image, does that count? There’s a crosswalk in the background does that count? Etc etc.
    The answer to all these questions is generally that you shouldn’t be asking. I can almost hear someone saying “You know what we mean.”
  - bobxmax 9 months ago
    
    I don't know why HN users in particular fixate so heavily on fringe issues when it comes to LLMs. Same as the exaggerations of hallucinations.
    
    motoboi 9 months ago
    
    Because hallucinations is something that from a distance looks very unimportant, but when looked closely is a structural problem. Some people here live very close to the LLM field.
    Structural because while a human being can be the judge of an LLM output, a computer (or another LLM) cannot.
    No amount of error correction is enough to turn an LLM output into a reliable input to another (possible dumb) computer system. Worse: each time that output is processed the error increases and when the final output is shown to an user, the error might have been amplified beyond human recovery (or recognition) capacity.
    Think about this: one user sends Amazon support an email asking to refund for a stolen item.
    Can this email be processed do feed an automatic refund pipeline system? If the answer is no and you need a human to verify the result, then we have one reason why hallucinations matter.
    And there are the cases where a user verification is not even possible, like:
    - what is the procedure to perform CPR in a person above 80 years old?
    The user can’t recover errors in the output generated by an LLM here, because she doesn’t know the correct answer.
    That being the case, you cannot build a search engine out of an LLM here. Hence hallucinations matters very much is this case too.
    Not even in the case of simple information extraction from a text you can ignore hallucinations, because if you provide a list of names and ask for all those starting with “A” you cannot be certain that all names output will actually start with “A” and most certainly cannot be certain that all correct names will be in the output. And this behavior cannot (as of today) be corrected on the LLM we have right now (the first part yes, the second part no).
    So, LLM with hallucinations are a very powerful tool, but not the tools they are being sold as.
    
    bobxmax 9 months ago
    
    Two questions:
    1) Which search engine comes with infallible information? 2) Where are LLMs being sold as something different?
    
    motoboi 9 months ago
    
    1) Current (traditional) search engines are indexes. They point to sources which can be read, analyzed and summarized by the human into information. LLM do the read, analysis and summarization part for the human.
    2) chatbots, perplexity search engine, summarization chrome extensions, RAG tools. Those all built over the idea that hallucination is a quirk, a little cog in the machine, a minor inconvenience to be dutifully noted (for legal reasons) but conveniently underestimated.
    Most things in life don’t have a compiler that will error on a inexistent python package.
    
    bobxmax 9 months ago
    
    > LLM do the read, analysis and summarization part for the human
    No they don't. The human is meant to read, analyze and summarize the output same as they would for search results
- stetrain 9 months ago
  
  > What is today's date?
  >> Today's date is Tuesday, January 28, 2025.
  > No, you're wrong, today's date is actually Wednesday the 29th.
  >> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
  Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
  - mr-wendel 9 months ago
    
    But thats exactly what you get when you ask questions that require shifting, specific contextual knowledge. The model weights, by their nature, cannot encode that information.
    At best, you can only try to layer in contextual info like this as metadata during inference, akin to how other prompting layers exist.
    Even then, what up-to-date information should present for every round-trip is a matter of opinion and use-case.
    
    lesuorac 9 months ago
    
    > The model weights, by their nature, cannot encode that information.
    This is mostly irrelevant no? A binary digit by definition cannot encode more than 2 dates; so therefore we devise a more elaborate system (of using multiple digits).
    This is very similar to NYT's lawsuit against OpenAI where in addition to other claims, they claimed OpenAI maintainted a DB of NYT articles that they would directly grab from for a response. It's seems very feasible to maintain a DB or system of looking up real-time values like dates / weather.
  - halfadot 9 months ago
    
    > Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
    Such an ingenious attack, surely none of these companies ever considered it.
  - genewitch 9 months ago
    
    the date is in the "system prompt", so the cron job that updates the prompts to the current date may be in a different time zone than you. 7f5dbb71f54322f271c4d3fc3aaa4d3282a1af5541d82b2cbc5aa10c1420b6bc
    
    nemomarx 9 months ago
    
    why can't they feed in user data like time zone and locale?
    
    vanviegen 9 months ago
    
    They're not actually processing the entire system prompt (which is rather long) on every query, but continuing from a model state saved after processing the system prompt once.
    That makes it a bit harder, but still, spitting out the wrong date just seems like a plain old time-zone bug.
- deegles 9 months ago
  
  not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
  an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
  - dematz 9 months ago
    
    A common use of these models is asking for code, and maybe you don't know the answer or would take a while to figure it out. For example, here's some html, make it blue and centered. You could give the model feedback on if its answer worked or not, without knowing the correct answer yourself ahead of time.
    
    deegles 9 months ago
    
    I was using llama3 and deepseek-r1 literally to center an element in a div and they were not able to despite many prompts and variations. I guess I figured it out in the end but I'm not convinced I saved any time vs just carefully reading flexbox docs.
  - vincentperes 9 months ago
    
    You constantly have to correct an AI when using it because it either didn't get the question right or you guide him towards a more narrowed answer. There is only more to learn.
  - Levitz 9 months ago
    
    >not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
    For your example, what if you want to show what such a mushroom looks like to a friend? What if you want to use it on a website?
    
    nemomarx 9 months ago
    
    I feel like conventional image search would be more reliable to get a good picture of a mushroom variety that you know about. Ideally going out into the woods to get one I suppose.
  - staticman2 9 months ago
    
    On topics like history or biology if a model's answer is surprising I might check Wikipedia and call it out on it's bullshit by explaining how Wikipedia contradicts it and pasting an excerpt from Wikipedia. But frankly if the model can't even reliably internalize Wikipedia I don't have much hope for complex feedback training based on my chats.
    While it's possible Wikipedia is wrong, the model always agrees with me when I correct it, so that isn't going to help with training either.
    Of course for anything high stakes relying on a model probably isn't a great idea.
- amluto 9 months ago
  
  Does it?
  If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
  - Vampiero 9 months ago
    
    Nah I just insult it and tell it that it costs me 20 dollars a month and it's a huge disappointment
- jeffbee 9 months ago
  
  If such labels are collected and used to retrain the model then yes. But these models are not learning online.
  - Cthulhu_ 9 months ago
    
    ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?
  - jvanderbot 9 months ago
    
    Really? Isn't that the point of RL used in the way R1 did?
    Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?
    I believe that's what GP meant by "respond", not telling GPT they were wrong.
    
    daveguy 9 months ago
    
    That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.
    
    jvanderbot 9 months ago
    
    Yes, that makes sense. We're both talking about offline learning.
- echelon 9 months ago
  
  > you provide a very valuable piece of data to train on
  We've been saying this "we get valuable data" thing since the 2010s [1].
  When will our collective Netflix thumbs ups give us artificial super-intelligence?
  [1] Especially to investors. They love that line.
  - genewitch 9 months ago
    
    our collective netflix thumbs up indicators gave investors and netflix the confidence to deploy a series of adam sandler movies that cost 60 to 80 million US dollars to "make". So depending on who you are, the system might be working great.
    
    beAbU 9 months ago
    
    Through analytics Netflix should know exactly when people stop watching a series, or even when in a movie they exit out. They no doubt know this by user.
    They know exactly what makes you stay, and what makes you leave.
    I would not be surprised if in the near future movies and series are modifed on the fly to ensure users stay glued to their screens.
    In the distant future this might be done on a per user level.
    
    JetSpiegel 8 months ago
    
    They did it the other way around, every Netflix series is now designed like a soap opera, to be watched in the background.
- aprilthird2021 9 months ago
  
  So if I just pay OpenAI $200/mo, and randomly tell the AI, no that's wrong.
  I can stop the AI takeover?
  - dr_kiszonka 9 months ago
    
    You would need a lot of pro accounts! I would be surprised if they didn't use any algorithms for detecting well poisoning.
  - Exoristos 9 months ago
    
    You can have our thank-you cards forwarded to your cell at Guantanamo Bay.
varispeed 9 months ago

My non technical cousin is a heavy paying user of ChatGPT, once she discovered that she can type incoherent stuff, with typos and whatnot and ChatGPT still will get the gist and produce satisfying answers, she will just type in tons of nonsense (to me) keep long chat sessions, complain it is getting slow and then get mad when I remind her to open new chat each time she has something new to ask that is not related to the previous chat. I have my doubt many users will provide valuable training data.
fizx 9 months ago

You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.
visarga 9 months ago

> I wonder if there is a cap to multi head attention architecture
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
fsndz 9 months ago

I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...
jcims 9 months ago

>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
- visarga 9 months ago
  
  > the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition
  It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
  In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
  This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
vagabund 9 months ago

> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
godelski 9 months ago
```
  > I highly doubt you are getting novel, high quality data.
```
That's not the point. The point is you reject low quality data, aka noise
- hassleblad23 9 months ago
  
  And how would that work at inference time?
  - bobxmax 9 months ago
    
    Why would it need to work at inference time?
ClumsyPilot 9 months ago

> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.
Why is it promising, aren’t you potentially amplifying AI biases and errors?
- pas 9 months ago
  
  it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")
  https://newsletter.languagemodels.co/i/155812052/large-scale...
  also from the posted article
  """
  The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.
  This makes intuitive sense, as language itself is effectively a reasoning DSL.
  """
verisimi 9 months ago

*SOTA state of the art
danbala 9 months ago

shouldn't the whole idea be: get away from needing data at all? if a model can really reason, it should be able to figure things out on its own.
- FuriouslyAdrift 9 months ago
  
  An LLM is just a really good parser connected to a lossy compressed corpus of data.
  They need to be open ended and self training to be truly useful.
  Reasoning is way far away...
gbasin 9 months ago

the main bottleneck will be model depth... you can only do so much with N layers, and recurrence has proven to be way less efficient (for now)
baq 9 months ago

It doesn't need much. 1 good lucky answer in a 1000 or maybe 10k queries gives you the little exponential kick you need to improve. This is how the hockey stick take off looks like and we're already here - OpenAI has it, now deepseek has it, too. You can be sure others also have it; Anthropic at the very least, they just never announced it officially, but go read what their CEO has been speaking and writing about.
- spyckie2 9 months ago
  
  I have looked a bit into Anthropic CEOs writings but if you can point in the right direction would be helpful!

Stevvo 9 months ago

"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"

Yet, they said when it was announced:

"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."

These two statements are completely opposed. I can't take seriously anything this article says about o3.

usaar333 9 months ago

No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
- joe_the_user 9 months ago
  
  The ARC prize was created last year. Arc hasn't resisted AI for very long.
  See; https://en.wikipedia.org/wiki/Fran%C3%A7ois_Chollet
  - usaar333 8 months ago
    
    Huh? It was made in 2019
    
    joe_the_user 8 months ago
    
    OK,
    I was following Wikipedia: In 2024, Chollet launched ARC Prize, a US$1 million competition to solve the ARC-AGI benchmark. I guess the ARC benchmark appeared in 2019 (https://arcprize.org/). So (shrug)
daveguy 9 months ago

Your quote is accurate from here:
https://arcprize.org/blog/oai-o3-pub-breakthrough
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
Bjorkbat 9 months ago

Glad someone brought this up.
I'm personally fine with o3 being tuned on the train set as a way to teach models "the rules of the game", what annoys me is that this wasn't also done with the o1 models or r1. It's a misleading comparison that suggests that o3 is a huge improvement over o1 when in reality much of that improvement may have simply been that one model knew which game it was playing and the others didn't.
7thpower 9 months ago

They are testing with a different dataset. The authors saying that they have not tested on the version of o3 that has not seen the training set.
- pertymcpert 9 months ago
  
  Yeah...the whole point is that you're testing the model on something it hasn't seen already. If the problems were in the training set by definition the model has seen them before.

anothermathbozo 9 months ago

The claim is that this removes the human bottleneck (aka SFT or supervised fine tuning) on domains with a verifiable reward. Critically, this verifiable reward is extremely hard to pin down in nearly all domains besides mathematics and computer science.

aithrowawaycomm 9 months ago

It's also extremely hard to nail down in much of mathematics or computer science!
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
- Davidzheng 9 months ago
  
  Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence
  - youoy 9 months ago
    
    Theorems discovery is amenable to verifiable rewards. But is meaningful theorems discovery too? Is the ability to discern between meaningful theorems and bad ones an emergent behaviour? You can check for yourself examples of automatic proofs, and the huge amount of intermediate theorems that they can generate which are not very meaningful.
    
    janalsncm 9 months ago
    
    Unless you can quantify what you mean by “meaningful” then it won’t be possible. It can’t read your mind.
- janalsncm 9 months ago
  
  For questions with a correct answer, you don’t need to verify the reasoning process. RL training will discover it. That’s R1-Zero.
  The point of R1 was to fix problems with the reasoning tokens and expand to subjective domains like creative writing.
nextos 9 months ago

IMHO, there are strategies that could extend this approach to many other domains.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
gadtfly 9 months ago

Reasoning transfers across domains.
- Philpax 9 months ago
  
  See https://www.interconnects.ai/p/why-reasoning-models-will-gen... for more information.
Onavo 9 months ago

By verifiable do they mean it in the complexity theory P/NP sense of the word?
- calebkaiser 9 months ago
  
  In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
  In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
- ks2048 9 months ago
  
  I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).
- logicchains 9 months ago
  
  As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).
  - bpfrh 9 months ago
    
    Isn't "code compiles" an insufficient criteria?
    e.g you would need to prove that for all inputs the code produces the correct output which would in turn make the problem way more complex
    
    pressbuttons 9 months ago
    
    Not if the problem as written is "does this code compile", which is still a useful stepping stone for some workflows. Yours is certainly a more useful query in most cases but repositioning or re-scoping the original question can still lead to a net win.
    
    logicchains 9 months ago
    
    It's not a sufficient criteria by itself, but where no better criteria is possible it would still produce better results in reinforcement learning than if the model has no reward for producing correctly compiling code vs code that failed to compile.
- drdeca 9 months ago
  
  The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.
  I would say… it is at least somewhat similar.
  A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y). (Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
  This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
  So, I would say there is a substantial similarity, but also a difference.
- HarHarVeryFunny 9 months ago
  
  For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!
  Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
- pertymcpert 9 months ago
  
  They mean that the solutions can be verified to be correct in a binary sense. E.g. a coding solution passes all the unit tests vs writing poetry.
- sgt101 9 months ago
  
  There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.
  For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
wildermuthn 9 months ago

This feels quite close to the definition of the singularity; if an LLM can become both the Generator and the Discriminator (to use a GAN analogy), then we have takeoff.
nycdatasci 9 months ago

Really? What about drug development? Protein folding?

visarga 9 months ago

> R1-Zero removes the human bottleneck

I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.

Bjorkbat 9 months ago

I'm still skeptical on the notion that we can remove the human bottleneck on code because code has verifiable solutions.
It's true only to the extent that there's sufficient test coverage to prevent any unwanted side effects. Easy to do with straight forward problems, far more difficult with more complex as well as open-ended problems.
mtrovo 9 months ago

The fact that both systems scored well on ARC AGI 1 shows they can handle unseen challenges without heavy human input, unless I'm missing something about why you see humans as the best interface for real world exploration.
janalsncm 9 months ago

In the case of ARC they are referring to verifiable math and reasoning problems. They still used SFT and model-based rewards for other domains.

mohsen1 9 months ago

The idea that a lot of compute is moving towards inference has a huge consequence for the current "AI investments". This is bad news for NVDA particularly. The inference focused solutions have better economics than paying NVDA those huge margins (e.g. Grog)

talldayo 9 months ago

Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
- logicchains 9 months ago
  
  >The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
  Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
- ClumsyPilot 9 months ago
  
  I think future of inference is on the client side
  You can do inference on almost any hardware, I do not see any edge for NVIDIA here
  I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
  The main bottleneck appears to be memory, not processing power.
  - buu700 9 months ago
    
    I would argue that both things are true:
    1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.
    2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.
  - mrbungie 9 months ago
    
    In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.
    
    gajjanag 9 months ago
    
    This is much more nuanced now. See Apple "Private Cloud Compute": https://security.apple.com/blog/private-cloud-compute/ ; they run a lot of the larger models on their own servers.
    Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
    
    talldayo 9 months ago
    
    Apple's strategy already failed. Their big bet on NPU hardware did not pay off at all, and right now it's effectively wasted silicon on every iDevice while the GPU does all the heavy inference work. Now they partner with OpenAI to handle their inference (and even that's not good enough in many cases[0]). The "centralized compute" lobby is being paid by Apple to do the work their devices cannot.
    Until Apple or AMD unifies their GPU architectures and implements complex streaming multiprocessors, Nvidia will remain in a class of their own. Apple used to lead the charge on the foremost CUDA alternative too, but then they abandoned it to focus on proprietary standards instead. It's pretty easy to argue that Apple shot themselves in the foot with every opportunity they had to compete on good faith. And make no mistake: Apple could have competed with Nvidia if they weren't so stubborn about Linux support and putting smartphone GPUs in laptops and desktops.
    [0] https://apnews.com/article/apple-ai-news-hallucinations-ipho...
  - snovv_crash 9 months ago
    
    Which AMD GPU gives you 50 tok/s on a 30b model? My 3090 does 30 tok/s with a 4 bit quant.
    
    ClumsyPilot 9 months ago
    
    I don't mean at the same time.
    For a simple question, with RX 6800, I am observing ~50 tok/s on 8B models Deepseek 16B gives ~40 tok/s. 32B doesn't fit in memory
- vidarh 9 months ago
  
  For inference Nvidia has more significant competition than for training. See Groq, Google's TPU's etc.
  - pants2 9 months ago
    
    People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.
    
    vidarh 9 months ago
    
    > You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week
    Nvidia sold $14.5 billion of datacenter hardware in the third quarter of their fiscal 2024 and that led to severe supply constraints, with estimate lead times for H100's up to 52 weeks some places, so no you can't, as that $14.5 billion was clearly capped by their ability to supply, not demand.
    You're right, though, that Groq etc. can't deliver anywhere near the same volume now, but there's little reason to believe that will continue. There's no need for full GPU's for inference only workloads, so competitors can enter the space with a tiny proportion of functionality.
    
    boroboro4 9 months ago
    
    Groq chips have 230 mb of sram memory. Good luck running 670B model on those chips, even without supply constraints.
    
    baq 9 months ago
    
    Their architecture means you buy them by the rack. Individual chips are useless, the magic happens when you set them up so each chip handles a subset of the model.
    IOW, do you think groq’s 70B models run on 230MB of sram?
    
    boroboro4 9 months ago
    
    I didn’t say the model gonna run on one chip of course. 70B needs ~300 chips (only for weights, fp8, just like they do, key value cache not included), 670B would need ~3000 chips, and in racks or not it’s very hard to set up such cluster for one model. There are reasons they still don’t have Llama 405B model.
    
    vidarh 9 months ago
    
    They deliver pre-built full racks.
    The "reasons" are most likely because it's not cost-effective as what is effective at this point a tech demo, that first becomes cheap to run if you're actually going to use a decent portion of the capacity for a single model.
    
    boroboro4 9 months ago
    
    How many servers in one rack? Let's say 42. How many chips in one server? Let's say 8. It's 336 cards per rack - enough for fp8 70B model weights (and, maybe, kv cache if your requests aren't too long, but probably not really). You need 10 (!) racks to serve one (!) DeepSeek model weights. There is also massive amount complexity arises from operating so many nodes.
    During short time when Groq hardware appeared on the market it was costing 20K per card. It's 60 mln (!) per 1 Deepseek model. You need absolutely crazy amount of load to justify those costs, and, most likely, you will need massive amount additional nodes to handle KV cache of those requests.
    
    vidarh 9 months ago
    
    Yes, you need a crazy amount of load for them to make sense. But when you're seeing providers build out whole data centres at a cost of billions, there you have their market.
    This is a market where several large Nvidia customers are designing their own chips (e.g. Meta, Amazon, Google) because they're at a scale where it makes sense to try.
    Whether it's a market that lets Groq be successful remains to be seen.
    
    vidarh 9 months ago
    
    Nvidia's H100 has 80GB. As long the interconnect is fast enough you don't need everything to fit on one.
    
    moralestapia 9 months ago
    
    You mean Cerebras.
    >call up Nvidia and order $10B worth of GPUs
    Doubt it.
    No idea about Groq, but Cerebras might give you a similar timeline than nVidia. Each of their wafers are 50x-100x H100s so they need to make less of them, in absolute units.
    But cooling, power, etc... nVidia might have an advantage as their ecosystem is huge and more "liquid" in a sense.
  - panabee 9 months ago
    
    Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
    The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
    OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
    Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
    They had the talent and the incentive to migrate, but didn't.
    In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
    People are desperate to quit their NVDA-tine addiction, but they can't for now.
    [Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
    
    vidarh 9 months ago
    
    The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.
    
    buyucu 9 months ago
    
    My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.
    
    panabee 9 months ago
    
    Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
    For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
    The calculus is even worse for SOTA LLMs.
    The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
    
    buyucu 9 months ago
    
    llm inference is fine on rocm. llama.cpp and vllm both have very good rocm support.
    llm training is also mostly fine. I have not encountered any issues yet.
    most of the cuda moat comes from people who are repeating what they heard 5-10 years ago.
    
    onlyrealcuzzo 9 months ago
    
    > OpenAI, Meta, AWS, AMD, and others have long attempted to eliminate the Nvidia tax, yet failed.
    Gemini / Google runs and trains on TPUs.
    You have no incentive to infer on AMD if you need to buy a massive Nvidia cluster to train.
    
    boroboro4 9 months ago
    
    Meta trains on Nvidia and infers on AMD. There is incentive if your inference costs are high.
    
    vidarh 9 months ago
    
    Meta also has a second generation of their own AI accelerator chips designed.
    
    panabee 9 months ago
    
    Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
    Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
    Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
    
    onlyrealcuzzo 9 months ago
    
    > yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
    It's almost as if being a first-mover is more important than whether or not you use CUDA.
    
    talldayo 9 months ago
    
    Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.
    Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.
    
    baq 9 months ago
    
    Google has a self inflicted wound in the time to get an api key.
    
    Der_Einzige 9 months ago
    
    The fact that this comment is DOWNVOTED despite being literally 1000% true is evidence that HN is full of loonies.
    
    panabee 9 months ago
    
    It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
    With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
- buyucu 9 months ago
  
  both llama.cpp and vllm support inference with rocm or vulkan.
  inference is the easiest thing to decouple from nvidia.
pertymcpert 9 months ago

So far it's moving towards test time compute true, but reasoning models are still far too large to be done on the edge.

epictwow 8 months ago

Hamlet says "to be or not to be", implying it's simply a choice. I choose to be!

epictwow2 8 months ago

Commit theft at sea. Now, being a pirate, I'm confident in my "Arrr"s!

dagelf 9 months ago

Well o3 scored 75% on AGI-1, R1 and o1 only 25%.... watch this space though....

levocardia 9 months ago

What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
- spoaceman7777 9 months ago
  
  I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
  We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
  e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
  We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
  There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
  - danenania 9 months ago
    
    That could also definitely make sense if the SOTA models are too slow and expensive to be popular with a general audience.
- amelius 9 months ago
  
  Yeah, but they can use DeepSeek's new algorithm too.
mohsen1 9 months ago

with 57 million(!!) tokens
- sheepdestroyer 9 months ago
  
  From the article :
  o3 (low) 75.7% 335K $20
  o3 (high) 87.5% 57M $3.4K
  - mrandish 9 months ago
    
    When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.
    If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.
  - jl6 9 months ago
    
    $3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.
    
    victorbjorklund 9 months ago
    
    You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.
    
    nl 9 months ago
    
    You absolutely do not get a "guarantee" of correctness (event with the airquotes) from any lawyer.
    
    manquer 9 months ago
    
    You can sue a lawyer giving certain kinds of bad advice and occasionally win . That is what the guarantee is about
    
    bobxmax 9 months ago
    
    You can probably sue Open AI for getting bad legal advice from ChatGPT too.
    
    throw-qqqqq 9 months ago
    
    Sure, but can you also win the case ;)?
    On the bottom of ChatGPT.com I see a disclaimer: “ChatGPT can make mistakes. Check important info”.
    I don’t think you can succesfully sue with such caveat emptor.
    
    ant6n 9 months ago
    
    What’s the liability insurance of the AI like
    
    baq 9 months ago
    
    Refer to IBM’s 1979 slide for details on that
- Davidzheng 9 months ago
  
  I view it as a positive that the methodology can take in more compute (bitter lesson style)
optimalsolver 9 months ago

But can o3 write a symphony?
Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.
- gsam 9 months ago
  
  In my view there's two modes of creativity:
  1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
  2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
  (Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
  - docfort 9 months ago
    
    Terry Tao has referred to this classification system as foxes vs hedgehogs. https://en.m.wikipedia.org/wiki/The_Hedgehog_and_the_Fox
- fragmede 9 months ago
  
  we'd have to create a numerical scale for creativity, from boring to Dali, with milliEschers and MegaGeigers somewhere in there as well
  - rpastuszak 9 months ago
    
    It's essential that we quantify everything so that we can put a price on it. I'd go with Kahlograms though.
- johnfn 9 months ago
  
  Have you tried suno.ai?
  - Vampiero 9 months ago
    
    Have _you_? It lost its novelty after a couple of days.
    
    drusepth 9 months ago
    
    I probably listen to Suno (both my own songs, and songs other people have created) about as often as I listen to Spotify, these days.
- baq 9 months ago
  
  LLMs have read everything humans made so just ask one if there’s anything truly new in that freshly confabulated slop-phony.

mikejulietbravo 9 months ago

Mike from Baseten here

We're super proud to support this work. If you're thinking of running deepseek in production, give us a shout!

fxttr 9 months ago

We currently evaluate DeepSeek-R1 for our production system. We aren't done yet, but I think it's a match.
- mikejulietbravo 9 months ago
  
  Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
littlestymaar 9 months ago

Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
Any idea of what they're doing wrong?
[1]: https://www.reddit.com/r/LocalLLaMA/comments/1icphqa/how_to_...
[2]: https://unsloth.ai/blog/deepseekr1-dynamic
- philipkiely 9 months ago
  
  They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
  The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
  - littlestymaar 9 months ago
    
    > They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
    That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.
    > The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
    What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.
- coder543 9 months ago
  
  I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
  As the comments on reddit said, those numbers don’t make sense.
  - littlestymaar 9 months ago
    
    > I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
    That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
    > As the comments on reddit said, those numbers don’t make sense.
    Absolutely, hence my question!
    
    coder543 9 months ago
    
    Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.
WhitneyLand 9 months ago

Can you share at a high level how you run this model?
We know it’s 671B params with each MOE node at 37B…
If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU?
How much do interconnects hurt performance vs being able to load the model into a single GPU?
- philipkiely 9 months ago
  
  Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
  There are two ways we can run it:
  - 8xH200 GPU == 8x141GB == 1128 GB VRAM
  - 16xH100 GPU == 8x80GB == 1280 GB VRAM
  Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
  More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
  - nv35 9 months ago
    
    > Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
    In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.

dakshgupta 9 months ago

>Generate chains-of-thought (CoT) for a problem domain. >Label the intermediary CoT steps using a combination of human experts (“supervised fine tuning” or SFT) and automated machines (“reinforcement learning” or RL). >Train base model using (2).

This is remarkably intuitive and elegant. Seems analogous to the idea that humans can come up with new knowledge by synthesizing from their current knowledge. Theoretical sciences or creative arts for example.

hnburnsy 9 months ago

>There are two major shifts happening in AI, economically speaking:

  You can now spend more $ to get higher accuracy and reliability
  Training $ is moving to inference $

>Both are going to drive a massive amount of demand for inference and neither will curtail the demand for more compute. In fact, they will increase the demand for compute.

Is this Nvidia compute or something else?

gorbypark 9 months ago

Nvidia has much less of a moat on the inference side of things. Of course they still dominate the market right now for inference (in datacenters), but it's much easier for companies to move onto AMD or other solutions like Groq or whatever compared to trying to use non-Nvidia for training.

polishdude20 9 months ago

I predict that the future of LLM's when it comes to coding and software creation is in "custom individually tailored apps". Imagine telling an AI agent what app you want, the requirements and all that and it just builds everything needed from backend to frontend, asks for your input on how things should work, clarifying questions etc.

It tests the software by compiling and running it reading errors and failed tests and fixing the code.

Then, it deploys the software in production for you. It compiles your app to an APK file and publishes it on the Google play store for example.

Sure an LLM now may still not be able to get everything perfect as far as it's outputs go. But surely there's already systems and workflows in place that will auto run your code, compile it, feed errors back to the LLM, some api to interact with cloud providers for hosting etc?

kristjansson 9 months ago

Most people really do not know what they want at any level of detail.
- travoc 9 months ago
  
  It's ok, they'll know it when they see it. Keep trying.
jrsdav 9 months ago

I have been trying to imagine something similar, but without all the middleware/distribution layer. You need to do a thing? The LLM just does it and presents the user with the desired experience. Kind of upending the notion that we need "apps" in the first place. It's all materialized, just-in-time style.
ClumsyPilot 9 months ago

> Imagine telling an AI agent … requirements… asks for your input on how things should work, clarifying questions etc.
That’s hard work. I watch people do that every day, and always get something wrong.
Also what about deploying the application, paying for database or cloud resource that will run it, etc?
jacobsenscott 9 months ago

What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want? Where will the record of those clarifying questions and updates be kept? What if one developer asks the AI to surreptitiously round off pennies and put those pennies into their bank account? Where will that change be recorded, will humans be able to recognize it? What if two developers give it conflicting instructions? Who's reviewing this stream of instructions to the LLM?
"AI" driven programming has a long way to go before it is just a better code completion.
- repelsteeltje 9 months ago
  
  That.
  Plus coding (producing a working program that fits some requirement) is the least interesting part of software development. It adds complexity, bugs and maintenance.
- throw310822 9 months ago
  
  > What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want?
  You're wrong here. The entire point is that these are not computers as we used to think of them. These things have common sense; they can analyse a problem including all the implicit aspects, suggest and evaluate different implementation methods, architectures, interfaces.
  So the right question is: "what's it called when you describe an app to a development team and they ask back questions and come back with designs and discuss them with you, and finally present you with an mvp, and then you iterate on that?"
  - Vampiero 9 months ago
    
    Bold of you to imply that GPT asks questions instead of making baseless assumptions every 5 words, even when you explicitly instruct it to ask questions if it doesn't know. When it constantly hallucinates command line arguments and library methods instead of reading the fucking manual.
    It's like outsourcing your project to [country where programmers are cheap]. You can't expect quality. Deep down you're actually amazed that the project builds at all. But it doesn't take much to reveal that it's just a facade for a generous serving of spaghetti and bugs.
    And refactoring the project into something that won't crumble in 6 months requires more time than just redoing the project from scratch, because the technical debt is obscenely high, because those programmers were awful, and because no one, not even them, understands the code or wants to be the one who has to reverse engineer it.
    Except that AI is actually MUCH more expensive!
    
    throw310822 9 months ago
    
    Of course, but who's talking about today's tools? They're definitely not able to act like an independent, competent development team. Yet. But if we limit ourselves to the here-and-now, we might be like people talking about GPT3 five years ago: "yes it does spit out a few lines of code, which sometimes even compiles. When it doesn't forget half way and starts talking about unicorns".
    We're talking about the tools of tomorrow, which, judging by the extremely rapid progress, I think is only a few (3-5) years away.
    Anyway, I had great experiences with Claude and DeepSeek.
jumploops 9 months ago

The future is bespoke software.
In some sense, this is how computers were always supposed to work!
IAmGraydon 9 months ago

Most software is useful because a large number of people can interact with it or with each other over it. I'm not so certain that one-off software would be very useful for anyone beyond very simple functionality.
prmph 9 months ago

This will almost certainly never materialize, and the reasons are not just technical
ForOldHack 9 months ago

Marvin Minsky promised that an AI would have a PhD, by 1950, and 1960... we are no closer. sorry. We are faster, much faster, 100,000,000 times faster, by we are no closer.
aprilthird2021 9 months ago

> auto run your code, compile it, feed errors back to the LLM,
Can't wait for companies to juice profits by having the LLM run excessive cycles or get stuck in a loop and run up my bill
- genewitch 9 months ago
  
  aider jams the backend on my PC, i have to kill the tcp connection or python to stop it running a GPU on the backend, from time to time. I can't imagine paying for tokens and not knowing if it's working or wasting money.
  - girvo 9 months ago
    
    The loops and constant useless changes drive me nuts haha
    
    genewitch 8 months ago
    
    aider sucessfully made, 1-shot, a 2048 clone in architect mode, serverless, local html+js+css. i pushed the git repo it made to my github, aider2048clone. I used deepseek-r1-llama-70b distill, it took ~3 hours. after the first 10 minutes i didn't want to interrupt it, because who cares how long it takes if it works?
    I haven't been able to get it to do anything but waste my tokens with deepseek itself as the backend (aider --model deepseek[/deepseek-reasoner|/deepseek-chat] i think but am not certain).
    
    genewitch 9 months ago
    
    I think the architect mode might be worth looking at but i'm going to attempt to aider.exe $(*.txt) and then switch to /ask mode and see if it can be used as a 0-shot document query.
    because even a rudimentary, garbage implementation would be fun to have, i think.
energy123 9 months ago

A little further out from that could be the LLM acting as the runtime environment. No code. It's just data in (user inputs etc) -> GUI out.
acchow 9 months ago

Have you tried https://bolt.diy ?
It does what you describe
- IAmGraydon 9 months ago
  
  It claims to do what he describes.
dboreham 9 months ago

It doesn't need to write tests: it can just use the application and figure out if it works.
- logicchains 9 months ago
  
  That's going to be much slower and more expensive than writing tests because image/video processing is slower and more expensive than writing tests. And because of lag in using the UI (and re-building the whole application from scratch after every change to test again).
  - drdeca 9 months ago
    
    Hm, what if instead of using video of the application…
    Ok, so if one can have one program snoop on all the rendering calls made by another program, maybe there could be a way of training a common representation of “an image of an application” and “the rendering calls that are made when producing a frame of the display for the application”? Hopefully in a way that would be significantly smaller than the full image data.
    If so, maybe rather than feeding in the video of the application, said representation could be applied to the rendering calls the application makes each frame, and this representation would be given as input as the model interacts with the application, rather than giving it the actual graphics?
    But maybe this idea wouldn’t work at all, idk.
    Like, I guess the rendering calls often involve image data in their arguments, and, you wouldn’t want to include the same images many time as the input to the encoding thing, as that would probably (or, I imagine) make it slower than just using the overall image of the application. I guess the calls are probably more pointing to the images in memory though, not putting an entire image on the stack.
    I don’t know enough about low-level graphics programming to know if this idea of mine makes any sense.
    
    achierius 9 months ago
    
    Yes, it would be significantly smaller, but it would look very different depending on your platform, GPU, driver version, etc. -- the model would essentially need to learn how to map "graphics APIs" (e.g. OpenGL, Vulkan, Metal, ...) to "render result" for every combination of API, driver version, and GPU, which I imagine would constitute a significant amount of overhead.
  - ClumsyPilot 9 months ago
    
    But it’s actually correct from a usability perspective
fragmede 9 months ago

I mean, we're halfway there, with aider and open-interpreter, just give it a couple of years

artninja1988 9 months ago

>Ultimately, R1-Zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.

I would like this to be true, but doesn't the way they're doing RL also require tons of human data?

Davidzheng 9 months ago

I think yes. But hopefully in math with compute advances we can lower the human data input by increasing the gap that is bridged by raw model capabilities vs search augmentation (either with tree search or full rollouts)

cbracketdash 9 months ago

It's a bit deceptive that o3 conveniently had access to ARC-prize-specific training material while r1 probably didn't. [0]

[0] https://news.ycombinator.com/item?id=42763231

usaar333 9 months ago

Overall good post, but feels like he has an axe to grind with LLMs to the point it is misleading:

> Last week, DeepSeek published their new R1-Zero and R1 “reasoner” systems that is competitive with OpenAI’s o1 system on ARC-AGI-1. R1-Zero, R1, and o1 (low compute) all score around 15-20% – in contrast to GPT-4o’s 5%, the pinnacle of years of pure LLM scaling

R1-zero gets 14% on private set which is the exact same score June Sonnet got; Sonnet, not 4o, is the pinnacle of pure LLM scaling

octacat 9 months ago

I think deepseek accidentally also killed google for me, not just chatgpt. Because of the visible reasoning part.

nogridbag 9 months ago

From what I read elsewhere (random reddit comment), the visible reasoning is just "for show" and isn't the process deepseek used to arrive at the result. But if the reasoning has value, I guess it doesn't matter even if it's fake.
- ziaowang 9 months ago
  
  Can you provide a link to the comment?
  R1's technical report (https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...) says the prompt used for training is "<think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant:" This prompt format strongly suggests that the text between <think> is made the "reasoning" and the text between <answer> is made the "answer" in the web app and API (https://api-docs.deepseek.com/guides/reasoning_model). I see no reason why deepseek should not do it this way, if not considering post-generation filtering.
  Plus, if you read table 3 of the R1 technical report, which contains an example of R1's chain of thought, its style (going back to re-evaluating the problem) resembles what I actually got in the COT in the web app.
- octacat 9 months ago
  
  Bad reddit comment though, try pair programming with it. Reasoning usually comments on your request, extends it, figures out which solution is the best and usable, backtracks if finds issues implementing it, proposes a new solution and verifies that it kinda makes sense.
  The result after that could actually look different though for usual questions (i.e. summarised in a way chatgpt answers on questions would look like). But it is usually very coherent with the code part, so if for example it has to choose from two libraries - it will use the one from the reasoning part, of course.
- cootsnuck 9 months ago
  
  That doesn't really make sense with how LLMs work. I think this is exactly why it's risky to use words like "thinking" and "reasoning".
  If by the "visible reasoning" is just for show they meant these models don't actually think and reason, then yes that is correct.
  But if they meant that the visible reasoning is not quite literally a part of inference process...that's entirely incorrect.
  R1 is open source. We don't have to make guesses about its functioning.
- dutchbookmaker 9 months ago
  
  There is no way this is true. It is just an example of why Reddit is a fucking joke that you should never read.
  I have seen it infer incredibly obscure things in the chain of thought that I was impressed it could piece together.
  It is an incredible tool. I would trust it 1000% more than a random person on reddit.

georgiivanov 9 months ago

Leaving r1 at 15.8% percent with $0.06/task without scaling to medium and high compute is a bit misleading imo. The whole point of DeepSeek is the efficiency.

Mr_Bees69 9 months ago

Does anyone have a copy of those aforementioned garbled chains of thought?

rybosome 9 months ago

Fascinating. R1 really punches above its weight with respect to cost-per-token.

As the article alluded to at the end, my thoughts immediately go to using R1 as a data generator for complex problems, since we have many examples of successful distillation into smaller models on well-defined tasks.

jojobas 9 months ago

Funnily they didn't exclude anything forbidden from the training dataset. It will gladly tell you about the Tienanmen Massacre and what not.

bandrami 9 months ago

Just because I'm an incurable cynic, has anybody run wireshark on it and checked that it actually does process entirely offline?

derwiki 9 months ago

It’s not like they provide a binary though, how would they make external network requests?
- bandrami 9 months ago
  
  I see we've forgotten xz already

dr_dshiv 9 months ago

> The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.

Um, what’s that now? Really?

nlpnerd 9 months ago

That is a slight exaggeration, extrapolation on the author's part. What happened was that RL training led to some emergent behavior in R1-Zero (chain-of-thought, and reflection) without being prompted or trained for explicitly. Don't see what is so domain specific about that though.
svdr 9 months ago

Yeah, if I understand correctly AI will create it's own internal reasoning language through RL. In R1-Zero it was already a strange mix of languages. They corrected that for R1 to make the thinking useful for humans.
- dutchbookmaker 9 months ago
  
  Not trying to be ironic but it would be interesting to see what this below would look like in the strange mix form:
  "If the model's actions involve generating tokens (like in language models), then optimizing these token outputs to maximize reward could lead the model to develop a consistent, efficient way of using tokens that's specific to the problem domain. This might look like a DSL because the tokens are used in a structured, perhaps abbreviated or symbolic way that's efficient for the task, not necessarily human-readable but effective for the model's internal processing."

eslammhmad18 9 months ago

Anas

Developerx 9 months ago

[flagged]

Developerx 9 months ago

[flagged]

sleepbyte 9 months ago

[dead]

anna41517 8 months ago

[flagged]