Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly.
It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?
Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further
Home rigs like that are no longer cost effective. You're better off buying an rtx pro 6000 outright. This holds both for the sticker price, the supporting hardware price, the electricity cost to run it and cooling the room that you use it in.
I was just watching this video about a Chinese piece of industrial equipment, designed for replacing BGA chips such as flash or RAM with a good deal of precision:
People with basement rigs generally aren't the target audience for these gigantic models. You'd get much better results out of an MoE model like Qwen3's A3B/A22B weights, if you're running a homelab setup.
1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.
2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.
On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/
You can notice that, while Chinese models are quite good, the gap to the top is still significant.
However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).
The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?
I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.
The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.
The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.
It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.
Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.
Assuming your hardware premise is right (and lets be honest, nobody really wants to send their data to chinese providers) You can use a provider like Cerebras, Groq?
Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark
And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.
Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".
"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?
Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.
The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?
There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?
It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."
Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.
It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.
What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.
Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!
So I love everything about the competitive nature of this.
Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)
Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)
They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s
"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race;
..."
Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).
> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed
Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
Benchmarks are super impressive, as usual. Interesting to note in table 3 of the paper (p. 15), DS-Speciale is 1st or 2nd in accuracy in all tests, but has much higher token output (50% more, or 3.5x vs gemini 3 in the codeforces test!).
The higher token output is not by accident. Certain kinds of logical reasoning problems are solved by longer thinking output. Thinking chain output is usually kept to a reasonable length to limit latency and cost, but if pure benchmark performance is the goal you can crank that up to the max until the point of diminishing returns. DeepSeek being 30x cheaper than Gemini means there’s little downside to max out the thinking time. It’s been shown that you can further scale this by running many solution attempts in parallel with max thinking then using a model to choose a final answer, so increasing reasoning performance by increasing inference compute has a pretty high ceiling.
For just the model itself: 4 x params at F32, 2 x params at F16/BF16, or 1 x params at F8, e.g. 685GB at F8. It will be smaller for quantizations, but I'm not sure how to estimate those.
For a Mixture of Experts (MoE) model you only need to have the memory size of a given expert. There will be some swapping out as it figures out which expert to use, or to change expert, but once that expert is loaded it won't be swapping memory to perform the calculations.
You'll also need space for the context window; I'm not sure how to calculate that either.
I think your understanding of MoE is wrong. Depending on the settings, each token can actually be routed to multiple experts, called experts choice architecture. This makes it easier to parallelize the inference (each expert on a different device for example), but it's not simply just keeping one expert in memory.
I think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).
3.2-Exp came out in September: this is 3.2, along with a special checkpoint (DeepSeek-V3.2-Speciale) for deep reasoning that they're claiming surpasses GPT-5 and matches Gemini 3.0
Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly.
How could we judge if anyone is "winning" on cost-effectiveness, when we don't know what everyones profits/losses are?
As much I agree with your sentiment, but I doubt the intention is singular.
Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851
It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?
Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further
Home rigs like that are no longer cost effective. You're better off buying an rtx pro 6000 outright. This holds both for the sticker price, the supporting hardware price, the electricity cost to run it and cooling the room that you use it in.
I was just watching this video about a Chinese piece of industrial equipment, designed for replacing BGA chips such as flash or RAM with a good deal of precision:
https://www.youtube.com/watch?v=zwHqO1mnMsA
I wonder how well the aftermarket memory surgery business on consumer GPUs is doing.
Or perhaps a 512GB Mac Studio. 671B Q4 of R1 runs on it.
People with basement rigs generally aren't the target audience for these gigantic models. You'd get much better results out of an MoE model like Qwen3's A3B/A22B weights, if you're running a homelab setup.
Yeah I think the advantage of OSS models is that you can get your pick of providers and aren't locked into just Anthropic or just OpenAI.
I genuinely do not understand the evaluations of the US AI industry. The chinese models are so close and far cheaper
Two aspects to consider:
1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.
2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.
On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/
You can notice that, while Chinese models are quite good, the gap to the top is still significant.
However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).
Thanks for sharing that!
The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?
I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)
1. Have you seen the Qwen offerings? They have great multi-modality, some even SOTA.
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.
The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.
The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.
It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.
Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.
Assuming your hardware premise is right (and lets be honest, nobody really wants to send their data to chinese providers) You can use a provider like Cerebras, Groq?
cerebras AI offers models at 50x the speed of sonnet?
According to OpenRouter, z.ai is 50% faster than Anthropic; which matches my experience. z.ai does have frequent downtimes but so does Claude.
Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark
> Valuation is not based on what they have done but what they might do
Exactly what I’m thinking. Chinese models catching rapidly. Soon to be on-par with the big dogs.
Even if they do continue to lag behind they are a good bet against monopolisation by proprietary vendors.
>I guess it's counting on deepseek being banned
And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.
Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".
"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?
Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.
The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?
There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?
It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."
Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.
It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.
What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.
Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!
So I love everything about the competitive nature of this.
If you thought DeepSeek "just wasn't good," there's a good chance you were running it wrong.
For instance, a lot of people thought they were running "DeepSeek" when they were really running some random distillation on ollama.
Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)
Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)
but didn't Chinese already surpass the rest of the world in Solar, batteries, EVs among other things ?
They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s
"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; ..."
Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).
> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed
Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
Benchmarks are super impressive, as usual. Interesting to note in table 3 of the paper (p. 15), DS-Speciale is 1st or 2nd in accuracy in all tests, but has much higher token output (50% more, or 3.5x vs gemini 3 in the codeforces test!).
The higher token output is not by accident. Certain kinds of logical reasoning problems are solved by longer thinking output. Thinking chain output is usually kept to a reasonable length to limit latency and cost, but if pure benchmark performance is the goal you can crank that up to the max until the point of diminishing returns. DeepSeek being 30x cheaper than Gemini means there’s little downside to max out the thinking time. It’s been shown that you can further scale this by running many solution attempts in parallel with max thinking then using a model to choose a final answer, so increasing reasoning performance by increasing inference compute has a pretty high ceiling.
what is the ballpark vram / gpu requirement to run this ?
For just the model itself: 4 x params at F32, 2 x params at F16/BF16, or 1 x params at F8, e.g. 685GB at F8. It will be smaller for quantizations, but I'm not sure how to estimate those.
For a Mixture of Experts (MoE) model you only need to have the memory size of a given expert. There will be some swapping out as it figures out which expert to use, or to change expert, but once that expert is loaded it won't be swapping memory to perform the calculations.
You'll also need space for the context window; I'm not sure how to calculate that either.
I think your understanding of MoE is wrong. Depending on the settings, each token can actually be routed to multiple experts, called experts choice architecture. This makes it easier to parallelize the inference (each expert on a different device for example), but it's not simply just keeping one expert in memory.
I think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).
3.2-Exp came out in September: this is 3.2, along with a special checkpoint (DeepSeek-V3.2-Speciale) for deep reasoning that they're claiming surpasses GPT-5 and matches Gemini 3.0
https://x.com/deepseek_ai/status/1995452641430651132
[dead]