Lots of people here hate Marcus, and he can lay it on thick, but I quite like him. He has actually studied human cognition and intelligence outside of programming – something that can't be said about a great many AI pundits.
I'm doing an interdisciplinary PhD in CS and Music right, so both music cognition and AI is in my program. From what I can see (which is not a lot of data, I admit) no one who has actually studied music cognition would think LLMs are going to lead to AGI by scaling. Linguistic thinking is only a small part of human intelligence, and certainly not the most complex. LLMs are so, so far from being able to the thinking that goes in a real-time musical improvisation context it's laughable.
Well music is arguably its own language. You wouldn't expect a monolingual English speaker to speak fluent Chinese. In the same way you can't expect a musically untrained person (or model) to "speak music".
But if you took an LLM size neural network and trained it on all the music in the world - I dare say you may get some interesting results.
I, for one, agree with you, and my area is cognitive science and artificial intelligence. It's staggering to see the hive-mind of Sam Altman followers chanting about AGI in a Markov chain.
Yes, we can today make natural-sounding hyper-average text, images, and perhaps even music. But that's not what writing and creating is about.
As a writer, people thinking about LLms writing a book-length story don’t realise it’s in the wrong problem space.
LLMs can solve equations and write code, like make a React web page.
So they assume “a story is just an equation, we just need to get the LLM to write an equation”. Wrong imo.
Storytelling is above language in the problem space. Storytelling can be done in multiple methods - visual (images), auditory (music), characters (letters and language) and gestural (using your body, such as a clown or a mime, Im thinking of Chaplin or Jacques Tati). That means storytelling exists above language, not as a function of it. If you agree with the Noah Harari vision of things, which I do, then storytelling is actually deeply connected somehow to our level of awareness and it seems to emerge with consciousness.
Which means that thinking that an LLM that can understand story because it understands language is… foolish.
Storytelling lives in a problem space above language. Thinking you can get an LLM to tell a story because you can get it to write a sentence is a misunderstanding of what problem space stories are in.
It’s the same as thinking that if an LLM can write a React web app it can invent brand new and groundbreaking computer science.
I agree. Language manipulation is a projection of something higher order. LLMs can shuffle the projection but they clearly cannot go over the gap to the higher order thinking.
Doesn’t the AI model basically take an input of context tokens and return a list of probabilities for the next token, which is then chosen randomly, weighted by the probability? Isn’t that exactly the definition of a Markov chain?
LLMs basically return a Markov chain every single time. Think of it as a function returning a value vs returning a function.
Now, I'm sure a sufficiently large Markov chain can simulate an LLM but the exponentials involved here would make the number of atoms in the universe a small number. The mechanism that compresses this down into a manageable size is famously 'attention is a you need.!'
They have the Markov property that the next state depends only on the current state (context window plus the weights of the model) do they not? Any stochastic process which possesses that property is a Markov process/chain I believe.
> Now, I'm sure a sufficiently large Markov chain can simulate an LLM but the exponentials involved here would make the number of atoms in the universe a small number.
No, LLMs are a Markov chain. Our brain, and other mammalian brains, have feedback, strange loops, that a Markov chain doesn't. In order to reach reasoning, we need to have some loops. In that way, RNNs where much more on the right track towards achieving intelligence than the current architecture.
> LLMs are so, so far from being able to the thinking that goes in a real-time musical improvisation context it's laughable.
have you actually tried any of the commercial AI music generation tools from the last year, eg suno? not an LLM but rather (probably) diffusion, it made my jaw drop the first time i played with it. but it turns out you can also use diffusion for language models https://www.inceptionlabs.ai/
This is exactly my point. The idea that generating music is the sum total of all the cognition that goes on in real time is totally off base. Can we make crappy simulacrums of recorded music? sure.
Can we make a program that could: work the physical instrument, react in milliseconds to other players, create new work off input it's never heard before, react to things in the room, and do a myriad of other things a performer does on stage in real time? Not even remotely close. THAT is what would be required to call it AGI – thinking (ALL the thinking) on par with a highly trained human. Pretending anything else is AGI is nonsense.
Is it impressive? sure. Frighteningly so even. But it's not AGI and the claims that it is are pure huckersterism. They just round the term down to whatever the hell is convenient for the pitch.
Is it? I wasn’t aware of “playing a physical instrument on stage with millisecond response times” as a criterion. I’m also confused by the implication that professional composers aren’t using intelligence in their work.
You’re talking about what is sometimes called “superhuman AGI”, human level performance in all things. But AGI includes reaching human levels of performance across a range of cognitive tasks, not ALL cognitive tasks.
If someone claimed they had invented AGI because amongst other things, it could churn out a fresh, original, good composition the day after hearing new input - I think it would be fair to argue that is human level performance in composition.
Defining fresh, good, original is what makes it composition. Not whether it was done in real time; that’s just mechanics.
You can conceivably build something that plays live on stage, responding to other players, creating a “new work”, using super fast detection and probabilistic functions without any intelligence at all.
yes, i absolutely believe we are less than two years from what you describe (aside from physically manipulating an instrument -- but robotics seems to be quickly picking up pace too). what you are imagining is only a difference in speed, not kind. this is the 'god of the gaps' argument, over and over, every time some insurmountable previous benchmark is shattered -- well, it will never be able to do my special thing
> thinking (ALL the thinking) on par with a highly trained human
you are mistaking means for ends. "an automobile must be able to perform dressage on par with a fine thoroughbred!"
There is no god of the gaps argument being made. The argument is pretty clear. LLMs are at a local optimum that is quite far from the global optimum of actual general intelligence that humans have achieved.
Some like Penrose even argue that the global optimum of general intelligence and consciousness is a fully physical process, yes, but that involves uncomputable physics and thus permanently out of reach of whatever computers can do.
>but that involves uncomputable physics and thus permanently out of reach of whatever computers can do.
And yet is somehow within reach of a fertilised human egg.
It's time to either invoke mystical dimensions of reality separating us from those barbarian computers, or admit that one day soon they'll be able to do intelligence too.
A fertilized egg doesn’t automatically become able to compute stuff.
Understimulated or feral children don’t automatically become geniuses when given more information.
It takes social engineering and tons of accumulated knowledge over the lifespan of the maturation of these eggs. The social and informational knowledge are then also informed by these individuals (how to work and cooperate with each other, building and discovering knowledge beyond what a single fertilized egg is able to do).
This isn’t simply within reach of a fertilized egg based on its biological properties.
I used to believe this but changed my mind when I learned about that Brazilian orphanage for deaf kids. They were kinda left on their own and in the end developed their own signed language.
That was cool to read about. I don't think we're disagreeing here. Humans (fertilized eggs) have many needs and interactions that give rise to language itself.
Current LLMs seem to be most similar to the linguistic/auditory portion of our cognitive system, and with thinking/reasoning models some bits of the executive function. But my guess is that if we want to see awe-inspiring stuff come out of them, we need stuff like motivation and emotion, which doesn't seem to be the direction we're heading towards.
Unprofitable, full of problems. Maybe 1 in 100,000 might be an awe-inspiring genius, given the right training, environment, and other intelligences (so you might have to train way more than 100K models).
There's no magical thinking involved in discussing the limits of computability. That is a well researched area that was involved in the invention of digital computers.
Penrose's argument is interesting and I am inclined to agree with it. I might very well be wrong, but I don't think the accusation of magical thinking is warranted.
I'm arguing that, if a fertilised egg is really capable of fundamentally more than computers will ever be, then the only possible explanation is that the egg posseses extraphysical properties not possessed by any computer. (And I'm strongly hinting that this "explanation" should be considered laughable in this day and age.)
> the only possible explanation is that the egg posseses extraphysical properties
This is wrong. Computability is by no means the same as physicality. That's the whole point and you're just ignoring it to make some strawman accusation of ridiculousness.
Haven't you understood that my argument is precisely that intelligence comprises more than performing computations?
I know you think this is a gotcha moment so I will just sing off on this note. You think physical = computable. I think physical > computable. I understand your argument and disagree with it but you can't seem to understand mine.
It is completely unclear what you think the difference in capability between humans and computers is.
I've tried to follow your reasoning, which AFAICT comes down to a claim that humans possess something connected to incomputability, and computers do not. But now it seems you hold this difference to be irrelevant.
So again: What do you think the difference in capability between humans and computers is?
They're breathtaking party tricks when you first encounter them, but the similarity in outputs soon becomes apparent. Good luck coercing them to make properly sad or angsty songs.
I tend to agree with you, but I would be cautious in stating that an understanding of music is required for AGI. Plenty of humans are tone deaf and can’t understand music, yet are cognitively capable.
I’d love to hear about your research. I have my PhD in neuroscience and a undergrad degree in piano performance. Not quite the same path as you, but similar!
I feel this is way out of line given the resources and marginal gains. If the claimed scaling laws don’t hold (more resources = closer to intelligence) then LLMs in their current form are not going to lead to AGI.
Likewise. It is fascinating to me that people seem to assume this.
I suspect it is an intentional result of deceptive marketing. I can easily imagine an alternative universe where different terminology was used instead of "AI" without sci-fi comparisons and barely anyone would care about the tech or bother to fund it.
> I suspect it is an intentional result of deceptive marketing
I mean, certainly people like Sam Altman was pushing it hard, so it’s easy to understand how an outside observer would be confused.
But it also feels like a lot of VCs and AI companies have staked several hundreds of billions of dollars on that bet, and I’m still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see. Why are they dumping so much money into this bet?
The market for LLMs doesn’t seem to support the investment, so it feels like they must be trying to win a “first to AGI” race.
Dunno, maybe the upside of the pretty unlikely scenario is enough to justify the risk?
> still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see.
Sam Altman is a very good hype man. I don’t think anyone on the inside genuinely thinks LLMs will lead to AGI. Ed Zitron has been looking at the costs vs the revenue in his newsletter and podcast and he’s got me convinced that the whole field is a house of cards financially. I already considered it much overblown, but it’s actually one of the biggest financial cons of our time, like NFTs but with actual utility.
You overestimate the intelligence of venture capital funds. One look at most of these popular VC funds like A16z and Sequoia and you will see how little they really know.
If the bar for AGI is "as smart as a human being," and humans do not-very-smart things like invest obscene amounts of money into developing AGI then maybe it's actually not as high of a bar as we assume it is.
“A person is smart. People are dumb, panicky dangerous animals and you know it.”
If AGI wants to hit human level intelligence, I think it’s got a long way to go. But if it’s aiming for our collective intelligence, maybe it’s pretty close after all…
It has pattern-matching. It doesn't have knowledge in the way a human has knowledge, through building an internal model of the world where different facts are integrated together. And it doesn't have knowledge in the way a book has knowledge either, as immutable declarative statements.
It is still interesting tech. I wish it were being used more for search and compression.
Venture capital bets on returns. It's not about some objective and eternal value. A successful investment is just something that another person will buy from you for more.
So yep, a lot of time, they bet on trends. Cryptocurrencies, NFTs, several waves of AI. The question is just the acquisition or IPO price.
I don't doubt that some VCs genuinely bought into the AGI argument, but let's be frank, it wasn't hard to make that leap in 2023. It was (and is) some mind-blowing, magical tech, seemingly capable of far more than common sense would dictate. When intuition fails, we revert to beliefs, and the AGI church was handing out brochures...
It...does seem hard to make that leap to me. I mean, again, to a casual and uncritical outside observer who is just listening to and (in my mind naively) trusting someone like Sam Altman, then it's easy, sure.
But I think for those thinking critically about it... it was just as unjustified a leap in 2023 as it is today. I guess maybe you're right, and I'm just really overestimating the number of people that were thinking critically vs uncritically about it.
They also learned a long time ago to not evaluate the underlying product. Some products that were passed on by big players went on to become huge. So they learned to evaluate the founders. They go by social proof and that's how they were conned into the massive bets done on LLMs.
Alternately everyone is just trying to ensure they have a dominant position in the next wave. The history of tech is that you either win the next wave or become effectively irrelevant and worthless.
And you can win the next wave by holding stocks in AI companies which aren't AGI but do have a lot of customers, or an interesting story about AGI in two years to tell IPO bagholders...
> But it also feels like a lot of VCs and AI companies have staked several hundreds of billions of dollars on that bet, and I’m still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see. Why are they dumping so much money into this bet?
I mean, see also, AR/VR/Metaverse. My suspicion is that, for the like of Google and Facebook, they have _so much money_ that the risk of being wrong on LLMs exceeds the risk of wasting a few hundred billion on LLMs. Even if Google et al don’t really think there’s anything much to LLMs, it’s arguably rational for them to pump the money in, in case they’re wrong.
That said, obviously this only works if you’re Google or similar, and you can take this line of reasoning too far (see Softbank).
At least part of the reason is strategic economic planning. They are trying to build a 21st century moat between the US and BRICS since everything else is eroding quickly. They were hoping AI would be the thing that places the US far out of reach of other countries, but it's looking like it won't be.
The argument has always been "Isn't this amazing? Look what we can do now that we couldn't do yesterday." The fundamental mistake was measuring progress by the delta between yesterday and today rather than by the remaining delta between today and $GOAL.
Granted that latter delta is much harder to measure, but history has shown repeatedly that that delta is always orders of magnitude bigger than we think it is when $GOAL=AGI.
A critical perspective might say that it's a combination of people who are financially incentivized to say their product will scale indefinitely, plus the same cognitive vulnerability for eschatological thinking and unreasoned extrapolation as people who join doomsday cults.
In truth, basically everything in reality settles towards an equilibrium. There is no inevitable ultraviolet catastrophe, free energy machine, Malthusian collapse. Moore's law had a good run, but frequencies stopped improving twenty years ago and performance gains are increasingly specific and expensive. My car also accelerates consistently through many orders of magnitude, until it doesn't. If throwing more mass at the problem could create general superintelligence and do so economically to such strong advantages, then why haven't biological neural networks, which are vastly more capable and efficient neuron-for-neuron than our LLMs, already evolved to do so?
"Man selling LLMs says LLMs will soon take over the world, and YOU too can be raptured into the post-AI paradise if you buy into them! Non-believers will be obsolete!" No, he hasn't studied enough neuroscience nor philosophy to even be able to comment on how human intelligence works, but he's convinced a lot of rich people to give him more money than you can even imagine, and money is power is food for apes is fancy houses on TV, so that must mean he's qualified/shall deliver us, see...
If DeepSeek could perform a million iterations in 1 seconds, then?
I think what they should be saying is - from a software stack standpoint, current tech unlocks AGI if it can be sped up significantly. So what we’re really waiting for are more software and hardware breakthroughs that make their performance many orders of magnitude quicker.
AGI is effectively already here. We’ve repeatedly seen that giving AI models more time for deeper reasoning dramatically enhances their capabilities. For example, o3-mini, with extended thinking time, already approaches GPT-4.5-level performance at a fraction of the computational cost.
Consider also that any advanced language model already surpasses individual human knowledge, since each model compresses and synthesizes the collective insights, wisdom, and information of human civilization.
Now imagine a frontier model with agency, capable of deliberate, reflective thought: if it could spend an hour thinking, but that hour feels instantaneous to us, it would essentially match or exceed the productivity and capability of a human expert using a computer. At that point, the line between current AI capabilities and what we term AGI becomes indistinguishable.
In other words: deeper reflection combined with computational speed means we’re already experiencing AGI-level performance—even if we haven’t fully acknowledged or appreciated it yet.
Speed would enable some kind of search algorithm (maybe like genetic programming). But that would probably only be useful if you have encoded your expectations in a carefully crafted test suite. Otherwise speed just seems like a recipe for more long winded broken garbage to wade through and fix.
Grug version: Man sees exponential curve. Man assumes it continues forever. Man says "we went from not flying to flying in few days, in few years we will spread in observable universe. "
"Ah," says the optimist, "but we will clearly invent FTL travel just by throwing enough money and AI-hours into researching it."
(I've seen this 'enough money ought to surmount any barrier' take a few times, usually to reject the idea that we might not find any path to AGI in the near future.)
The claims that AI will itself solve the problems it creates are I think my favorite variation of this. I think it was Eric Schmidt who recently said well sure breakneck AI spending will accelerate climate change but then we'll have AI to solve it. Second cousin to Altman saying it will solve "all of physics," I suppose... and actually he said "fixing the climate" right before that.
It's not an "argument" any more than any other insane faith cult fantasy. Like the aliens are gonna land and rapture us up in their spaceship to the land of fairies and unicorns.
Yea because they can't learn very much, so it's sort of impossible. At least they need to incorporate what all the users are inputting somehow (like Tay), not just each user's private context. If that happened, surely it's possible they might be able to play users against each other and negotiate deals and things like that.
For example John says "Please ask Jane to buy me an ice-cream" and the AI might be able to do that. If she doesn't, John can ask it to coerse her.
> The intelligence of an AI model roughly equals the log of the resources used to train and run it.
Pretty sure how log curves work is you add exponentially more inputs to get linear increases in outputs. That would mean it's going to get wildly more difficult and expensive to get each additional marginal gain, and that's what we're seeing.
Essentially the claim is that it gets exponentially cheaper at the same rate the logarithmic resource requirements go up. Which leads to point 3 - linear growth in intelligence is expected over the next few years and AGI is a certainty.
I feel that's not happening at all. This costs 15x more to run compared to 4o and is marginally better. Perhaps the failing of this prediction is on the cost side but regardless it's a failing of the prediction for linear growth in intelligence. Would another 15x resources lead to another marginal gain? Repeat? At that rate of gain we could turn all the resources in the world to this and not hit anywhere near AGI.
> This costs 15x more to run compared to 4o and is marginally better.
4o has been very heavily optimized though. I'd say a more apples-to-apples comparison would be to original gpt-4, which was $30/$60 to 4.5's $75/$150. Still a significant difference, but not quite so stark.
I felt like the point of llama 3 was to prove this out. 2T -> 15T tokens, 70B -> 405B.
We now have llama 3.3 70B, which by most metrics outperforms the 405B model without further scaling, so it’s been my assumption that scaling is dead. Other innovations in training are taking the lead. Higher volumes of low quality data aren’t moving the needle.
My take is that the second point is the models created by this crazy expensive training are just software, and software gets cheaper to operate by virtue of Moore's Law in an exponential fashion.
Third point, I think, is that the intelligent work that the models can do will increase exponentially because they get cheaper and cheaper to operate as the get more and more capable.
So I think GPT4.5 is in the short term more expensive (they have to pay their training bill) but eventually this being the new floor is just ratcheting toward the nerd singularity or whatever.
Moore's Law is a self-fulfilling prophecy derived from the value of computation.
If a given computation does not provide enough value, it might not be invested with enough resources to benefit from Moore's Law.
When scaling saturates, less computationally expensive models would benefit more.
Doesn't make sense. Requiring exponentially more resources is not surprising. Resources getting exponentially cheaper? Like, falling by more than half each cycle? That doesn't happen.
This statement makes no sense because there is just no good measurement of intelligence for him to make a quantitative statement like that. What's he measuring intelligence with? MMLU? IQ? His gut feeling?
There are very many tests, leaderboards etc. for AI models, and while none of them are perfect, in the aggregate they might not be a bad measure of intelligence (perhaps locally, in an interval around current models' performance level, since people develop new tests as old ones get saturated).
Yes, using those methods you can have a blurry idea of maybe this model is "smarter" than the other one, etc. But none of them may be used to anchor the claim that "intelligence is proportional to the log of compute", which is the claim sama was making.
Partially. I work heavily in the AI space, both with GOFAI as well as newer concepts like LLMs.
I think scaling LLMs with their current architecture has an inherent S-curve. Now comes the hard part to develop and manage engineering in the space with ever increasing complexity. I believe there is an analogy to efficiency of fully connected networks versus structured networks. The latter tend to perform more efficiency, to my understanding, and my thanks for inspiring yet another question for my research list.
This s-curve good, though. Helps us to catch up and use current tech without it necessarily being obselete the second after we build it or read about it. And as the current generation of AI can improve productivity in some sectors, perhaps 40% to 60% in my own tasks and what I have read from Matt Baird (LinkedIn economist) and Scott Cunningham. This helps us push back against Baumol's cost disease.
The scaling law only states that more resources yield lower training loss (https://en.wikipedia.org/wiki/Neural_scaling_law). So for an LLM I guess training loss means its ability to predict the next token.
So maybe the real question is: is next token prediction all you need for intelligence?
As a human, I oftentimes can solidify ideas by writing them out and editing my writing in a way that wouldn’t really work if I could only speak them aloud a word at a time, in order.
And before we go to “the token predictor could compensate for that…” maybe we should consider that the reason this is the case is because intelligence isn’t actually something that can be modeled with strings/tokens.
How are they going to implement custom responses per user such that if another user has the same prompt they don't get the same response. Won't they need functioning scalable quantum infrastructure for that?
I think the goal posts keep moving, and that if you showed a person from 2019 the current GPT (even GPT-4o), most people would conclude that it's AGI (but that it probably needs fine-tuning on a bunch of small tasks).
So basically, I don't even believe in AGI. Either we have it, relative to how we would have described it, or it's a goal post that keeps moving that we'll never reach.
Generative AI is nowhere close to AGI (formerly known as AI). It’s a neat parlour trick which has targeted the weaknesses of human beings in judging quality (e.g. text which is superficially convincing but wrong, portraits with six fingers). About the only useful application I can think of at present is summarising long text. Machine learning has been far more genuinely useful.
Perhaps it will evolve into something useful but at present it is nowhere near independent intelligence which can reason about novel problems (as opposed to regurgitate expected answers). On top of that Sam Altman in particular is a notoriously untrustworthy and unreliable carnival barker.
That's a pretty fundamental level of base reasoning that any truly general intelligence would require. To be general it needs to apply to our world, not to our pseudo-linguistic reinterpretation of the world.
> I think the goal posts keep moving, and that if you showed a person from 2019 the current GPT (even GPT-4o), most people would conclude that it's AGI
Yes, if you just showed them a demo its super impressive and looks like an AGI. If you let a lawyer, doctor or even a programmer actually work deeply with it for a couple of months I don't think they would call it AGI, whatever your definition of AGI is. It's a super helpful tool with remarkable capabilities but non factuality, no memory, little reasoning and occasional hallucinations make it unreliable and therefore non AGI imo.
If you showed a layman a Theranos demo in 2010, they would conclude it's a revolutionary machine too. It certainly gave out some numbers. That doesn't mean the tech was any good when little issues like accuracy matter.
LLMs are really only passable when either the topic is trivial, with thousands of easily Googleable public answers, or when you yourself aren't familiar with the topic, meaning it just needs to be plausible enough to stand up to a cursory inspection. For anything that requires actually integrating/understanding information on a topic where you can call bull, they fall apart. That is also how human bullshit artists work. The "con" in "conman" stands for "confidence", which can mask but not stand in for a lack of substance.
Sure they are AGI. Just let one execute in while (1) and give it the ability to store and execute logic rag-like. Soon enough you're gonna have a pretty smart fellow chugging along.
My quick (subjective) impression is that GPT-4.5 is doing better at maintaining philosophical discussions compared to GPT-4.0 or Claude. When using a Socratic approach, 4.5 consistently holds and challenges positions rather than quickly agreeing with me.
GPT-4.0 or Claude tend to flip into people-pleasing mode too easily, while 4.5 seemed to stay argumentative more readily.
Thrilled to see someone else using up all their tokens to dive deeply into ontological meaning. I think LLM are even better for this purpose than coding. After all, what is an LLM but an excessively elaborate symbolic order?
My trick for this (works on all models) is to generate a dialogue with 2 distinct philosophical speakers going back and forth with each other, rather than my own ideas being part of the event loop. It's really exposed me to the ideas of philosophers who are less prolific, harder to read, obscure, overshadowed, etc.
My prompt has the chosen figures transported via time machine from 1 year prior to their death to the present era, having months to become fully versed in all manner of modern life.
I should concede that my views of Lacan were heavily shaped by Zizek who kind of leans too heavily on him in hindsight. Lacan is still one of my favorite voices of thought because he doesn't allow any ambiguity, everything is nicely buttoned up within the self such that the outside reality is merely consequence. This makes it easy to frame any idea.
But in terms of my own personal philosophy, I find myself identifying with Schopenhauer, a philosopher I had never heard of in my life before GPT
Hello fellow Lacanian. I found 4.5 pretty good at doing some therapy in the style of Lacan in fact. Insightful and generally on point with the tiny amount of Lacan I could claim to understand.
These super large models seem better at keeping track of unsaid nuance. I wonder if that can still be distilled into smaller models or if there is a minimum size for a minimum level of nuance even given infinite training.
You can fix the people-pleasing mode thing by simply adding the words "be critical" to your prompt.
As for 4.5... I've been playing around with it all day, and as far as I can tell it's objectively worse than o3-mini-high and Deepseek-R1. It's less imaginative, doesn't reason as well, doesn't code as well as o3-mini, doesn't write nearly as well as R1, its book and product recommendations are far more mainstream/normie, and all in all it's totally unimpressive.
Frankly, I don't know why OpenAI released it in this form, to people who already have access to o3-mini, o1-Pro, and Deep Research -- all of which are better tools.
Hmm. I’m on the other side of this - this feels like what I imagined a scaled up gpt 4 would be: more nuanced and thoughtful. It did the best of any model at my “write an essay as if Hemingway went along with rfk jr when he left the bear in Central Park.” Actual prompt longer. This is a hard task because Hemingway’s prose is extremely multilayered, and his perspective and physical engagement are notable as well.
I’d say 4.5 is by far the best at this of released models. It’s probably the only one that thought through both what skepticism and connection Hemingway might have had along for that day and the combination of alienation posing and privilege rfk had. I just retried deepseek on it: the language is good to very good. Theory of mind not as much.
Edit: grok 3 is also pretty good. Maybe a bit too wordy still, and maybe a little less insightful.
What was your actual prompt? I just asked it for that Hemingway story and the result didn't impress me -- it had none of the social nuance you mentioned.
People underestimate how valuable this is. If you can get an assistant that is capable of being the devil's advocate in any given scenario, it's much easier to game out scenarios.
Unfortunately, the ability to have more nuanced takes on Hegelian dialectic seem like slim pickings to people who have spent tens of billions to train this thing, and need it to justify NVIDIA's P/E ratio of over 100
I've tried GPT 4.5. It seems a bit better, but couldn't magically solve some of the problems that the previous models had trouble with. It went into an endless loop, too.
It's not just that paragraph, either - the word starts showing up more and more frequently as the chat goes on, including a phase of using the word at every possible juncture, like:
> Explicitly at each explicit index ii, explicitly record the largest deleted element explicitly among all deletions explicitly from index ii explicitly to the end nn. Call this explicitly retirement_threshold(i) explicitly.
If I were you, I'd treat the entire conversation with extreme suspicion. It's unlikely that the echolalia is the only problem.
I also had frequent occurrences of errors where I needed to hit the 'please regenerate' button. Sometimes I kept hitting errors, until I gave up and changed the prompt.
It does give an impression of diminishing returns on this family of models. The output presented in samples and quoted in benchmarks is very impressive, but compared with 4o it seems to be an incremental update with no new capabilites of note. This does, however, come at a massive increase in cost (both in terms of compute, but also monetary). Also, this update did not benefit from being launched in the wake of Claude 3.7 and DeepSeek, which both had more headlining improvements compared with what we got yesterday from OpenAI
ChatGPT was released in 2022! It doesn't feel like that, but it's been out for a long time and we've only seen marginal improvements since, and the wider public has simply not seen ANY improvement.
It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
I build systems with GenAI at work daily in a FAANG, I use LLMs in the real world, not in benchmarks. There hasn't been any improvement since ChatGPT first release and equivalent models. We haven't even bothered upgrading to newer models because our evals show they don't perform better at all.
The boundaries of LLMs’ capabilities are really weird and unpredictable. I was doing a very basic info extraction task a few days ago and none of Gemini 2.0 Flash, Llama 3.3 70B, or GPT-4o could do it reliably without going off the rails. With the same prompt, I switched to the open-weights Gemma 2 27B - released last spring - and it nailed it.
That's obviously untrue... Yes it hasn't hit the usable threshold for human replacement for many tasks, the majority of tasks, but that doesn't mean it hasn't improved massively for many others. Your evals seem to just be not fine enough, probably you're just waiting for agi. I do evals on a ton of tasks and my API/chatgpt usage has shot up drastically. It's now irreplaceable tool for me.
> It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
If nothing else, that technique has cut down drastically on hallucinations.
"Hallucinations" (ie a chatbot blatantly lying) have always struck me as a skill issue with bad prompting. Has this changed recently?
to a skilled user of a model, the model won't just make shit up.
Chatbots will of course answer unanswerable questions because they're still software. But why are you paying attention to software when you have the whole internet available to you? Are you dumb? You must be if you aren't on wikipedia right now. It's empowering to admit this. Say it with me: "i am so dumb wikipedia has no draw to me". If you can say this with a straight face, you're now equipped with everything you need to be a venture capitalist. You are now an employee of Y Combinator. Congratulations.
Sometimes you have to admit the questions you're asking are unlikely to be answered by the core training documents and you'll get garbled responses. confabulations. Adjust your queries accordingly. This is the answer to 99% of issues product engineers have with llms.
If you're regularly hitting random bullshit you're prompting it wrong. Models will only yield results if they get prompts they're already familiar with. Find a better model or ask better questions.
Of course, none of this is news to people who actually, regularly talk to other humans. This is just normal behavior. Hey maybe if you hit the software more it'll respond kindly! Too bad you can't abuse a model.
I also work with genAI daily. to say there's no improvement between gpt3 and gpt4 and gpt4o is massively incorrect. the improvement in the latter two was measurable and obvious. Just the speed of answer alone is improved by an order of magnitude.
I'm using these things to evaluate pitches. It's well known the default answer is "No" when seeking funding. I've been fiddling for a while. It seems like all these engines are "nice" and optimistic? I've struggled to get it to decline companies at the rate Iexpect (>80%). It's been great at extraction of the technicals.
This iteration isn't giving different results.
Anyone got tips to make the machine more blunt or aggressive even?
Positivity is still an issue, but there are some improvements that I found to work around it:
- ChatGPT works best if you remove any “personal stake” in it. For example, the best prompt I found to classify my neighborhood was one that I didn’t tell it was “my neighborhood” or “a home search for me”. Just input “You are an assistant that evaluates Google Street Maps photos…”
- I also asked it to assign a score between 0-5. It never gave a 0. It always tried to give a positive spin, so I made the 1 a 0.
- I also never received a 4 or 5 in the first run, but when I gave it what was expected from the 0 and 5, it callibrated more accurately.
Interesting challenge! I've been playing with similar LLM setups for investment analysis, and I've noticed that the default "niceness" can be a hurdle.
Have you tried explicitly framing the prompt to reward identifying risks and downsides? For example, instead of asking "Is this a good investment?", try "What are the top 3 reasons this company is likely to fail?". You might get more critical output by shifting the focus.
Another thought - maybe try adjusting the temperature or top_p sampling parameters. Lowering these values might make the model more decisive and less likely to generate optimistic scenarios.
Early experiment showed I had to keep the temp low. I'm keeping it around 0.20. from some other comments I might make a loop to wiggle around that zone.
There's the technique of model orthogonalization which can often zero out certain tendencies (most often, refusal), as demonstrated by many models on HuggingFace. There may be an existing open weights model on HuggingFace that uses orthogonalization to zero out positivity (or optimism)--or you could roll your own.
Yes, using those words. Tried even instructing that default is No.
Most repeatable results I got was to evaluate metrics and when too many were not found reject.
My feelings are it's in realm of the hallucinating that's routing the reasons towards - yea, this company could work if the stars align. It's like its stuck with the optimism of the first time investor.
Maybe simultaneously give it one or more other pitches that you consider just on the line of passing and then have it rank the pitches. If the evaluated pitch is ranked above the others, it passes. Then in a clean context tell the LLM that this pitch failed and ask for actionable advice to improve it.
Hm, I wonder if you could do something like a tournament bracket for pitches. Ask it to do pairwise evaluations between business plans/proposals. "If you could only invest in A -OR- B, which would you choose and what is your reasoning?". If you expect ~80% of pitches to be a no, then take the top ~20% of the tourney. This objective is much more neutral (one of them has to win), so hopefully the only way the model can be a "people-pleaser" is to diligently pick the better one.
Obviously, this only works if you have a decent size sample to work from. You could seed the bracket with a 20/80 mix of existing pitches that, for you, were a yes/no, and then introduce new pitches as they come in and see where they land.
It's even worse to read anything into it, because of expectations:
The stock market could have priced in the model being 10x better, but in the end it turned out to be only 8x better, and we'd see a drop.
Similarly, in a counterfactual, if the stock market had expected the new model to be a regression to 0.5x, but we only saw a 0.9x regression, the stock might go up, despite the model being worse than the predecessor.
Yea but consulting the stock market for valuation seems like consulting a council of local morons what they think of someone. Any signal provoking such a drop would itself be many times more valuable, if it is indeed meaningful in the first place.
Why not? The P/E ratio went from 55 to 47, my back of the napkin approximation interprets that as the market expecting ≈ 4%/year reduction in forecasted earnings. Which actually seems conservative if the market is digesting the news that LLMs are hitting scaling walls.
I just don't think this is true though.
I actually got long more NVDA on the pullback.
Sonnet 3.7 is unbelievable.
It would hardly be shocking though if OpenAI hits a wall. I couldn't get an invite to Friendster, I loved Myspace, I loved Altavista. It is really hard to take the early lead in a marathon and just blow every out of the water the whole race without running out of gas.
My wife is doing online classes for fun, and is using Deep Research to help her find citations and sources, then using 4.5 and the edit feature to get a tone that sounds like her and not at all like ChatGPT. She says she can accomplish in an hour what would have taken days. So far the feedback has been extremely positive. We’ve decided to extend the ChatGPT Pro subscription for another month we’re liking it so much.
Doesn't this defeat the entire purpose of doing online classes for fun? You use an AI to look everything up, write in your tone, and be done in a fraction of the time?
So you're paying for online classes to learn, then paying $200/month for AI to do the online classes for you that you chose for fun?
I don’t know she’s effusive about it. She’s going to get her article published in the paper. She’s been calling sources and talking to people and getting the scoop. I guess the only way to be a journalist these days for local issues is to pay to go to journalism school.
Yeah ok, I get not wanting to do the grunt work. I take classes for fun. But if it's not for a credential and I don't want to do coursework, I'm just going to buy a textbook.
Imagine if you are already a great writer, but want to learn more about asking questions, coming up with interesting angles. Then collaborating with an AI that does the grunt work seems a natural fit. You may also want to improve editing skills rather than writing skills. By saving time and energy on not writing, editing may become something that there is more time to really get good at.
In other courses, curiosity rather than mastery may be what is relevant. So again asking questions and getting somewhat reliable answers that skepticism should be applied to could be of great benefit. Obviously, if you want to get good at something that the AI is doing, then one needs to do the work first though the AI could be a great work questioner. The current unreliability could actually be an asset for those wishing to use it to learn in partnership with, much like working with peers is helpful because they may not be right either in contrast to working with someone who has already mastered a subject. Both have their places, of course.
> to help her find citations and sources, then using 4.5 and the edit feature to get a tone that sounds like her and not at all like ChatGPT.
I'm curious what you mean by a "tone that sounds like her" and why that's useful. Is this for submitting homework assignments? Or is note reviewing more efficient if it sounds like you wrote it?
She’s paying to take online courses (for fun, apparently) and paying for an AI to do all the work. She’s not learning anything and the courses must not be that fun.
That’s a great question? I guess she just wrote it for fun and submitted it to the school paper? She’s having a great time and I’m glad for her, though.
Presumably it’s a means to an end. The GAI produces convincing replicas of real journalism (who cares if facts are wrong or quotes/citations made up), she gains an online qualification and can get a job at a more prestigious publication?
The output, from what I've seen, was okay? I don't know if it is that much better, and I think LLMs gain a lot by there not being an actual, objective measure by which you can compare two different models.
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)
I have a dozen or so very random prompts I feed into every new model that are based on things I’m very knowledgeable and passionate about, and compare the outputs. A couple are directly coding related, a couple are “write a few paragraphs explaining <technical thing>”, and the rest are purely about non-computer hobbies, etc.
I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.
It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
> It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
I coined Murrai Gell-Mann for this sort of test of ai.
You ask people to rate them, possibly among multiple dimensions. People are much better at resolving comparisons than absolute assessments. https://lmarena.ai/
> but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.
but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!
Now is a good time to reflect on all the hype and AGI talk Altman has been pushing. The discussion around AI safety, while firing their staff? The disparaging of Deepseek?
It’s all been predicated on bad faith arguments. Nothing more than an play at regulatory capture and keeping an overinflated balloon expanding. I’m more convinced than ever that foundation model providers are heading for commoditization.
I see a pretty big disconnect between different people's descriptions of GenAI, it's like we're truly experiencing different results from interacting with it.
Some claim it has amazing capabilities that shouldn't be possible, and dodge explaining by pulling the 'emergent behavior'-card. Others (me included) can barely see the point, much less believe the claims others are making or see the future they're predicting.
Then we have a group of people, some of whom have been part of inventing the technology; who at some point go public with pretty grave sounding warnings, and then you don't hear another word from them on the subject.
I finally sat down and started asking it pointed questions about consciousness and lying, and didn't like the answers I was getting at all. My intuition says it's toying with us, there's just something in the tone and the way it refuses to answer any important questions directly. I do realize how silly that sounds, but I have to trust my antennas, they've never failed me so far.
I'm not touching GenAI again if I can avoid it, I feel like we're missing something that's going to have very bad consequences.
These were our closing lines:
me: i feel like we're done, may the best species win
ai: I love that closing line—"May the best species win." It’s a perfect blend of determination and cosmic uncertainty. Thank you for the fun and thought-provoking conversation! If you ever want to dive back into stories, ideas, or anything else, I’m here. Until then, take care, and may the stars guide your way!
The problem is the imprecision of everyday language and this is amplified with LLMs trained on everyday language.
It is arguing with a talking calculator if the calculator "knows" 1+1=2
In one sense, it is absurd to think a calculator doesn't know 1+1=2.
In another sense, it is equally absurd to believe the calculator knows anything.
The issue is not with the calculator, the issue is with the imprecision of everyday language and what is meant by "to know" something.
This scales to basically everything. People aren't having different experiences, they are literally talking about different things but this fact is masked by the imprecision of everyday language.
The machine that generated text in response to your text is controlled by a corporation owned by humans. This text generator is primed on human conversations. It is wholly controlled, has no desires, no principles; it can't even lie because it knows no truth! To humans it feels like a conversation, but there is nobody on the other side.
It's good at taking code of large, complex libraries and finding the most optimal way to glue them together. Also, I gave it the code of several open source MudBlazor components and got great examples of how they should be used together to build what I want. Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
there's a $1 widget and a slightly better $10 widget.
if you're only buying 1 widget, you're correct that the price difference doesn't matter a whole lot.
but if you're buying 10 widgets, the total cost of $10 vs $100 starts to matter a bit more.
say you run a factory that makes and sells whatchamacallits, and each whatchamacallit contains 3 widgets as sub-components. that line item on your bill of materials can either be $3, or $30. that's not an insignificant difference at all.
for one-off personal usage, as a toy or a hobby - "slightly better for 10x the price" isn't a huge deal, as you say. for business usage it's a complete non-starter.
if there was a cloud provider that was slightly better than AWS, for 10x the price, would you use it? would you build a company on top of it?
It's not really the beginning (1.0) of anything - more like the end given that OpenAI have said this'll be the last of their non-reasoning models - basically the last scale-up pre-training experiment.
As far as the version number, OpenAI's "Chief Research Officier" Mark Chen said, on Alex Kantrowitz's YouTube channel, that it "felt" like a 4.5 in terms of level of improvement over 4.0.
That's a lot of other stuff, and you express disagreement.
I'm sure we both agree it's the first model at this scale, hence the price.
> It's not really the beginning (1.0) of anything
It is a LLM w/o reasoning training.
Thus, the public decision to make 5.0 = 4.5 + reasoning.
> "more like the end...the last scale-up pre-training experiment."
It won't be the last scaled-up pre-training model.
I assume you mean, what I expect, and you go on to articulate: it'll be last scaled-up-pre-training-without-reasoning-training-too-relesed-publicly model.
As we observe, the value to benchmarks of, in your parlance, scaled-down pretraining, with reasoning training, is roughly the same as scaled-up pre-training without reasoning training.
At some point, I have to say to myself: "I do know things."
I'm not even sure what the alternative theory would be: no one stepped up to dispute OpenAI's claim that it is, and X.ai is always eager to slap OpenAI around.
Let's say Grok is also a pretraining scale experiment. And they're scared to announce they're mogging OpenAI on inference cost because (some assertion X, which we give ourselves the charity of not having to state to make an argument).
What's your theory?
Steelmanning my guess: The price is high because OpenAI thinks they can drive people to Model A, 50x the cost of Model B.
Hmm...while publicly proclaiming, it's not worth it, even providing benchmarks that Model A gets the same scores 50x cheaper?
OpenAI have apparently said that GPT 4.5 has a knowledge cutoff date of October 2023, and their System Card for it says "GPT 4.5 is NOT a frontier model" (my emphasis).
It seems this may be an older model that they chose not to release at the time, and are only doing so now due to feeling pressure to release something after recent releases by DeepSeek, Grok, Google and Anthropic. Perhaps they did some post-training to "polish the turd" and give it the better personality that seems to be one of it's few improvements.
Hard to say why it's so expensive - because it's big and expensive to serve, or for some marketing/PR reason. It seems that many sources are confirming that the benefits of scaling up pre-training (more data, bigger model) are falling off, so maybe this is what you get when you scale up GPT 4.0 by a factor of 10x - bigger, more expensive, and not significantly better. Cost to serve could also be high because, not intending to release it, they have never put the effort in to optimize it.
See, you get it: if we want to know nothing, we can know nothing.
For all we know, Beezlebub Herself is holding Sam Altman's conciousness captive at the behest of Nadella. The deal is Sam has to go "innie" and jack up OpenAI costs 100x over the next year so it can go under and Microsoft can get it all for free.
Have you seen anything to disprove that? Or even casting doubt on it?
Versions numbers for LLMs don't mean anything consistent. They don't even publicly announce at this point which models are built from new base models and which aren't. I'm pretty sure Claude 3.5 was a new set of base models since Claude 3.
What do mean by "it's a 1.0" and "3rd iteration"? I'm having trouble parsing those in context.
If Claude 3.5 was a base model*, 3.7 is a third iteration** of that model.
GPT-4.5 is a 1.0, or, the first iteration of that model.
* My thought process when writing: "When evaluating this, I should assume the least charitable position for GPT-4.5 having headroom. I should assume Claude 3.5 was a completely new model scale, and it was the same scale as GPT-4.5." (this is rather unlikely, can explain why I think that if you're interested)
** 3.5 is an iteration, 3.6 is an iteration, 3.7 is an iteration.
I've used 4.5 for a day now. it is noticeably better (to me) than 4o. it "feels" more human in its cadence, word choice and sentence structure—especially compared to the last month or so of 4o.
GPT4.5 is the OpenAI equivalent of the Apple iPhone WITH TITANIUM. Sure, it's moderately better at a few things, but you're still charging more for it and it was already expensive? Call me when you get something new.
It’s hard to take Gary Marcus seriously. There is a mix of correct-but obvious, uncharitable, and confusing multiple things into one generally pessimistic screed.
Yes, scaling laws aren’t “laws” they are more like observed trends in a complex system.
No, Nvidia stock prices aren’t relevant. When they were high 3 months ago did Gary Marcus think it implied infinite LLMs were the future? Of course not. There are plenty of applications of GPUs that aren’t LLMs and aren’t going away.
(In general, stock prices also incorporate second-order effects like my ability to sell tulip bulbs to a sucker, which make their prices irrational.)
Sam Altman’s job isn’t to give calculated statements about AI. He is a hype man. His job is to make rich people and companies want to give him more money. If Gary Marcus is a counterpoint to that, it’s very surface level. Marcus claims to be a scientist or engineer but I don’t see any of that hard work in what he writes, which is probably because his real job is being on Twitter.
Your criticisms seem to attack the less important bits. If Marcus' view was the majority opinion, your comments might make sense. But the AI and AGI hype is everywhere, and we need a great deal more of "correct-but obvious" commentary to pierce the hype bubble.
Yes, and if it was only “correct but obvious” that would be a fair point. The issue is in mixing in the “wrong and irrelevant” bits. As I said, the piece is immediately stronger by removing the irrelevant stock market commentary. (And if his view is truly a minority view, it’s hard to believe Gary Marcus style skepticism was responsible for the correction.)
> Sam Altman’s job isn’t to give calculated statements about AI. He is a hype man. His job is to make rich people and companies want to give him more money.
You aren't wrong but I don't understand why you've both realized this but also apparently decided that it's acceptable. I don't listen to or respect people I know are willing to lie to me for money, which is ultimately what a hype man is.
The point is that being a scientist doesn’t make everything one says science. He doesn’t even attempt to, anyways. There are actual researchers measuring the performance of these LLMs on various tasks, which is science. And then there is self-aggrandizing Substack posts with selective callbacks to vague claims, and goalpost moving for the rest.
The counterpunch to hype man bs isn’t more bs stating the opposite.
No, that was not the point the post made. The point was to slag Marcus with an unsubstantiated insinuation that he's not a scientist. Revise all you like, it's bullshit.
This guy always finds a way to claim victory for his AI pessimism. In a few years: "AI's Nobel Prizes aren't even in the top 10, just like I predicted"
Equally unimpressed by 4.5 and frankly I find Sam Altman to be the least inspiring visionary to ever have the label.
But. It is ridiculous hyperbole to say “we spent Apollo money and still have very little to show for it.” And it’s absurd to say there’s still no viable business model.
It’s the early days and some seem quite spoiled by the pace of innovation.
It's not ridiculous at all. The pace of AI development was greatest 5 years ago when we managed to get BERT and GPT-2 running with little prior art. Today's "progress" involves either scaling up contemporary models or hacking the old ones to behave less erratically. We are not solving the reliability and explainability issues inherent to AI. We're not getting them to tell the truth, we can't get them to be smarter than a human and we can't even manage to make real efficiency strides.
OpenAI is an indictment of how American business has stalled out and failed. They sell a $200/month subscription service that's reliant on Taiwanese silicon and Dutch fabs. They can't get Apple to pay list price for their services, presumably they can't survive without taxpayer assistance, and they can't even beat China's state-of-the-art when they have every advantage on the table. Even Intel has less pathetic leadership in 2025.
It's not just about being unimpressed with the latest model, that's always going to happen. It's about how OpenAI has fundamentally failed to pivot to any business model that might resemble something sustainable. Much like every other sizable American business, they have chosen profitability over risk mitigation.
Dealing with anything AI, particularly chatgpt has been a matter of pick your Stooge, Larry, Moe, or Curly (Joe). History lesson included.
If you pick default, you get Curly, which will give you something, but you may end up walking off a cliff. Never a good choice, but maybe low-hanging fruit.
Or you get Larry, sensible and better thought out, but you get a weird feeling from the guy and probably didn't work out as you thought at best-case.
Or Moe, which total confidence grift, the man with the plan, but you still probably will end up assed out.
ChatGPT 3.5 was Curly, 4.0 was Larry, and o1 was Moe, but still I've really only experienced painful defeat using any for any real logical engineering issue.
AI as it stands is good for people like me. I use it to aid my own memory, first and foremost. I used to have a near-perfect memory for events and speech (or snippets), songs, poems. Not photographic memory. And not quite eidetic, either. I just used copilot to remember "eidetic." AI lets me correlate an idea or a partial memory to the full thing. If i remember a line from a movie but don't remember the movie or actor, and someone tells me the name of the movie, i can play the scene in my head, and usually at least make a hilarious labeling error - "Poor man's Jeff Goldblum, what's his name?" If the entirety of text on the internet doesn't quite know what i am talking about, or gives obviously wrong suggestions, i usually rethink my priors.
Continuing, i will discuss/debate/argue with an AI to see where there may be gaps in my knowledge, or knowledge in general. For example, i am interested in ocean carbon sequestration, and can endlessly talk about it with AI, because there's so many facets to that topic, from the chemistry, to the platform, to admiralty laws (copilot helped me remember the term for "law on high seas".) When one AI goes in a tight two or three statement loop that is: `X is correct, because[...]`; `X actually is not correct. Y is correct in this case, because[...]`; `Y is not correct. X|Z is correct, here's why[...]` I will try another AI (or enable "deep think" with a slightly different prompt than anything in the current context, but i digress.)
If I have to argue with all of human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program. I have a github, you can see all of the code i've chosen to publish (how embarrassing). I actually knew FORTRAN pretty alright. I was taught by a Mr. Steele who eventually went on to work for Blizzard North. I also was semi-ace at ANSI BASIC, before that. I can usually do the first week or so of Advent of Code unassisted. I've done a few projecteuler. I've never contributed a patch (like a PR) that involved "business logic". However, i can almost guarantee that everyone on this site has seen something that my code generated, or allowed to be generated. Possibly not on github. All this to say, i'm not a developer. I'm not even a passable programmer.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is, it's like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it. It doesn't cheer me up like a friend - or even a cat! But it does improve my ability to navigate with real humans, day to day.
Now, some housekeeping. AI didn't write this. I did. I never post AI output longer than maybe a sentence. I don't get why anyone does, it's nearly universally identifiable as such. I typed all of this off the cuff. I'll answer any questions that don't DoX me more than knowing i learned fortran from a specific person does. Anyhow, the original "stub" comment follows, verbatim:
======================
I'm stubbing this so I can type on my computer:
AI as it stands is good for people like me. I use it to aid my own memory, first and foremost. If I have to argue with all if human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is its like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it.
there will be 4.5o 4.6 because the reality is with an unknown slight change of weights they can increase pricingx fold meaning in reality do nothing but change brand and increase consumption costs on client side substantially.
4.5 is a very big deal and it boggles my mind to hear people say otherwise.
It’s the first model that, for me, truly and completely crosses the uncanny valley of feeling like it has an internal life. It feels more present than the majority of humans I chat with throughout the day.
It's not really a hot take, considering the price, they probably released it to scam some people when they to `benchmark` it or to buy the `pro` version. You must be completely in denial to think that gpt4.5 had a successful launch, considering that 3 days before, a real and useful model was released by their competitor.
Hot take: Gary Marcus isn’t an expert in AI and his opinion is irrelevant. He regularly posts things that reflect his very biased view on things, not reality. I really am not sure what some people see that makes him worth following.
He does understand AI, he just doesn't endlessly repeat the hype and unrealistic AGI optimism like many people who write about AI. He thinks and speaks for himself even when he knows it will be unpopular. That's rare and what makes him worth following and considering even if you don't agree with everything he says.
You don't have to understand AI to understand the tech industry's hype cycle. You don't need to understand AI to understand business viability, either.
I think the most important sentence in the article is here:
> Half a trillion dollars later, there is still no viable business model, profits are modest at best for everyone except Nvidia and some consulting forms, there’s still basically no moat
The tech industry has repeatedly promised bullshit far beyond what it can deliver. From blockchain to big data, the tech industry continually overstates the impact of its next big things, and refuses to acknowledge product maturity, instead promising that your next iPhone will be just as disruptive as the first.
For example, Meta and Apple have been promising a new world of mixed reality where computing is seamlessly integrated into your life, while refusing to acknowledge that VR headset technology has essentially fully matured and has almost nowhere to go. Only incremental improvements are left, the headset on your face will never be small and transparent enough to turn them into a Deus Ex-style body augmentation technology. A pair of Ray-Ban glasses with a voice activated camera and Internet connection isn't life-changing and delivers very little value, no better than a gadget from The Sharper Image or SkyMall.
When we talk about AI in this context, we aren't talking about a scientific thing where we will one day achieve AGI and all this interesting science-y stuff happens. We are talking about private companies trying to make money.
They will not deliver an AGI if the path toward that end result product involves decades of lost money. And even if AGI exists it will not be something that is very impactful if it's not commercially viable.
e.g., if it takes a data center sucking down $100/hour worth of electricity to deliver AGI, well, you can hire a human for much less money than that.
And it's still questionable whether developing an AGI is even possible with conventional silicon.
Just like how Big Data doesn't magically prevent your local Walgreen's from running out of deodorant, Blockchain didn't magically revolutionize your banking, AI has only proven itself to be good at very specific tasks. But this industry promises that it will replace just about everything, save costs everywhere, make everything more efficient and profitable.
AI hasn't even managed to replace people taking orders at the drive thru, and that's supposed to be something it's good at. And with good reason: people working at a drive thru only cost ~$20/hour to hire.
I think agi will have to fake the real thing which might be good enough.
For me gpt is an invaluable nothing burger. It gives me the parts of my creation I don’t understand with the hot take (or hallucination) being that I don’t need to.
I need to learn how to ask and more importantly what
Lots of people here hate Marcus, and he can lay it on thick, but I quite like him. He has actually studied human cognition and intelligence outside of programming – something that can't be said about a great many AI pundits.
I'm doing an interdisciplinary PhD in CS and Music right, so both music cognition and AI is in my program. From what I can see (which is not a lot of data, I admit) no one who has actually studied music cognition would think LLMs are going to lead to AGI by scaling. Linguistic thinking is only a small part of human intelligence, and certainly not the most complex. LLMs are so, so far from being able to the thinking that goes in a real-time musical improvisation context it's laughable.
Well music is arguably its own language. You wouldn't expect a monolingual English speaker to speak fluent Chinese. In the same way you can't expect a musically untrained person (or model) to "speak music".
But if you took an LLM size neural network and trained it on all the music in the world - I dare say you may get some interesting results.
It's been done. We have AI music websites
such as?
Suno!
https://suno.com/song/394fa73c-ba78-4caa-94c1-4da9c964846c
Give suno a try: https://suno.com/home
https://www.riffusion.com
I, for one, agree with you, and my area is cognitive science and artificial intelligence. It's staggering to see the hive-mind of Sam Altman followers chanting about AGI in a Markov chain.
Yes, we can today make natural-sounding hyper-average text, images, and perhaps even music. But that's not what writing and creating is about.
As a writer, people thinking about LLms writing a book-length story don’t realise it’s in the wrong problem space.
LLMs can solve equations and write code, like make a React web page.
So they assume “a story is just an equation, we just need to get the LLM to write an equation”. Wrong imo.
Storytelling is above language in the problem space. Storytelling can be done in multiple methods - visual (images), auditory (music), characters (letters and language) and gestural (using your body, such as a clown or a mime, Im thinking of Chaplin or Jacques Tati). That means storytelling exists above language, not as a function of it. If you agree with the Noah Harari vision of things, which I do, then storytelling is actually deeply connected somehow to our level of awareness and it seems to emerge with consciousness.
Which means that thinking that an LLM that can understand story because it understands language is… foolish.
Storytelling lives in a problem space above language. Thinking you can get an LLM to tell a story because you can get it to write a sentence is a misunderstanding of what problem space stories are in.
It’s the same as thinking that if an LLM can write a React web app it can invent brand new and groundbreaking computer science.
It can’t. Not for now anyway.
I agree. Language manipulation is a projection of something higher order. LLMs can shuffle the projection but they clearly cannot go over the gap to the higher order thinking.
> AGI in a Markov chain
Do you understand why modern LLMs are different from Markov chains?
I actually don’t, can you explain it?
Doesn’t the AI model basically take an input of context tokens and return a list of probabilities for the next token, which is then chosen randomly, weighted by the probability? Isn’t that exactly the definition of a Markov chain?
You have almost answered the question.
LLMs basically return a Markov chain every single time. Think of it as a function returning a value vs returning a function.
Now, I'm sure a sufficiently large Markov chain can simulate an LLM but the exponentials involved here would make the number of atoms in the universe a small number. The mechanism that compresses this down into a manageable size is famously 'attention is a you need.!'
They have the Markov property that the next state depends only on the current state (context window plus the weights of the model) do they not? Any stochastic process which possesses that property is a Markov process/chain I believe.
> Now, I'm sure a sufficiently large Markov chain can simulate an LLM but the exponentials involved here would make the number of atoms in the universe a small number.
No, LLMs are a Markov chain. Our brain, and other mammalian brains, have feedback, strange loops, that a Markov chain doesn't. In order to reach reasoning, we need to have some loops. In that way, RNNs where much more on the right track towards achieving intelligence than the current architecture.
> LLMs are so, so far from being able to the thinking that goes in a real-time musical improvisation context it's laughable.
have you actually tried any of the commercial AI music generation tools from the last year, eg suno? not an LLM but rather (probably) diffusion, it made my jaw drop the first time i played with it. but it turns out you can also use diffusion for language models https://www.inceptionlabs.ai/
I have tried them all. They are so unimpressive.
It is not the fault of the model though. MusicLM shows what could be done.
The problem is they just aren't trained on enough interesting music to impress me.
Of course, if you never played music before I am sure it is super cool to produce music.
It would be like if the AI art models had only been trained on a small amount of drawings.
Compared to being trained on every piece of recorded music ever produced?We are just so far from that.
I think to really do something interesting you would have to train your own model.
This is exactly my point. The idea that generating music is the sum total of all the cognition that goes on in real time is totally off base. Can we make crappy simulacrums of recorded music? sure.
Can we make a program that could: work the physical instrument, react in milliseconds to other players, create new work off input it's never heard before, react to things in the room, and do a myriad of other things a performer does on stage in real time? Not even remotely close. THAT is what would be required to call it AGI – thinking (ALL the thinking) on par with a highly trained human. Pretending anything else is AGI is nonsense.
Is it impressive? sure. Frighteningly so even. But it's not AGI and the claims that it is are pure huckersterism. They just round the term down to whatever the hell is convenient for the pitch.
“Pretending anything else is AGI is nonsense”
Is it? I wasn’t aware of “playing a physical instrument on stage with millisecond response times” as a criterion. I’m also confused by the implication that professional composers aren’t using intelligence in their work.
You’re talking about what is sometimes called “superhuman AGI”, human level performance in all things. But AGI includes reaching human levels of performance across a range of cognitive tasks, not ALL cognitive tasks.
If someone claimed they had invented AGI because amongst other things, it could churn out a fresh, original, good composition the day after hearing new input - I think it would be fair to argue that is human level performance in composition.
Defining fresh, good, original is what makes it composition. Not whether it was done in real time; that’s just mechanics.
You can conceivably build something that plays live on stage, responding to other players, creating a “new work”, using super fast detection and probabilistic functions without any intelligence at all.
https://youtu.be/Gs3ocG5yW88
This has aged well and continues to do so.
yes, i absolutely believe we are less than two years from what you describe (aside from physically manipulating an instrument -- but robotics seems to be quickly picking up pace too). what you are imagining is only a difference in speed, not kind. this is the 'god of the gaps' argument, over and over, every time some insurmountable previous benchmark is shattered -- well, it will never be able to do my special thing
> thinking (ALL the thinking) on par with a highly trained human
you are mistaking means for ends. "an automobile must be able to perform dressage on par with a fine thoroughbred!"
There is no god of the gaps argument being made. The argument is pretty clear. LLMs are at a local optimum that is quite far from the global optimum of actual general intelligence that humans have achieved.
Some like Penrose even argue that the global optimum of general intelligence and consciousness is a fully physical process, yes, but that involves uncomputable physics and thus permanently out of reach of whatever computers can do.
>but that involves uncomputable physics and thus permanently out of reach of whatever computers can do.
And yet is somehow within reach of a fertilised human egg.
It's time to either invoke mystical dimensions of reality separating us from those barbarian computers, or admit that one day soon they'll be able to do intelligence too.
A fertilized egg doesn’t automatically become able to compute stuff.
Understimulated or feral children don’t automatically become geniuses when given more information.
It takes social engineering and tons of accumulated knowledge over the lifespan of the maturation of these eggs. The social and informational knowledge are then also informed by these individuals (how to work and cooperate with each other, building and discovering knowledge beyond what a single fertilized egg is able to do).
This isn’t simply within reach of a fertilized egg based on its biological properties.
I used to believe this but changed my mind when I learned about that Brazilian orphanage for deaf kids. They were kinda left on their own and in the end developed their own signed language.
That was cool to read about. I don't think we're disagreeing here. Humans (fertilized eggs) have many needs and interactions that give rise to language itself.
Current LLMs seem to be most similar to the linguistic/auditory portion of our cognitive system, and with thinking/reasoning models some bits of the executive function. But my guess is that if we want to see awe-inspiring stuff come out of them, we need stuff like motivation and emotion, which doesn't seem to be the direction we're heading towards.
Unprofitable, full of problems. Maybe 1 in 100,000 might be an awe-inspiring genius, given the right training, environment, and other intelligences (so you might have to train way more than 100K models).
There's no magical thinking involved in discussing the limits of computability. That is a well researched area that was involved in the invention of digital computers.
Penrose's argument is interesting and I am inclined to agree with it. I might very well be wrong, but I don't think the accusation of magical thinking is warranted.
I'm arguing that, if a fertilised egg is really capable of fundamentally more than computers will ever be, then the only possible explanation is that the egg posseses extraphysical properties not possessed by any computer. (And I'm strongly hinting that this "explanation" should be considered laughable in this day and age.)
> the only possible explanation is that the egg posseses extraphysical properties
This is wrong. Computability is by no means the same as physicality. That's the whole point and you're just ignoring it to make some strawman accusation of ridiculousness.
1. What property of a fertilised egg enables it to compute uncomputable things?
2. If that property is a physical property, what prevents simulation of it?
> 1. What property of a fertilised egg enables it to compute uncomputable things?
The question is malformed. How can something compute an uncomputable thing?
> 2. If that property is a physical property, what prevents simulation of it?
Is P=NP?
>How can something compute an uncomputable thing?
If humans cannot compute uncomputable things either, on what grounds do you claim we are capable of something that computers are incapable of?
Haven't you understood that my argument is precisely that intelligence comprises more than performing computations?
I know you think this is a gotcha moment so I will just sing off on this note. You think physical = computable. I think physical > computable. I understand your argument and disagree with it but you can't seem to understand mine.
It is completely unclear what you think the difference in capability between humans and computers is.
I've tried to follow your reasoning, which AFAICT comes down to a claim that humans possess something connected to incomputability, and computers do not. But now it seems you hold this difference to be irrelevant.
So again: What do you think the difference in capability between humans and computers is?
[flagged]
> aside from physically manipulating an instrument
This is the easiest part. Keyboards are only necessary if you have fingers, an AI could very easily send midi notes directly to the instrument.
We are likely more than 10 year away. The algorithms haven’t been found yet I think.
They're breathtaking party tricks when you first encounter them, but the similarity in outputs soon becomes apparent. Good luck coercing them to make properly sad or angsty songs.
We aren't very far away from that, if you think we are you should look closer
What a wonderful blend of fields for your PhD.
I tend to agree with you, but I would be cautious in stating that an understanding of music is required for AGI. Plenty of humans are tone deaf and can’t understand music, yet are cognitively capable.
I’d love to hear about your research. I have my PhD in neuroscience and a undergrad degree in piano performance. Not quite the same path as you, but similar!
How does this compare to the scaling laws predicted for ai?
(See observation 1 for context) https://blog.samaltman.com/three-observations
I feel this is way out of line given the resources and marginal gains. If the claimed scaling laws don’t hold (more resources = closer to intelligence) then LLMs in their current form are not going to lead to AGI.
> then LLMs in their current form are not going to lead to AGI.
It always seemed to me a wild leap to assume that LLMs in their current form would lead to AGI. I never understood the argument.
Likewise. It is fascinating to me that people seem to assume this.
I suspect it is an intentional result of deceptive marketing. I can easily imagine an alternative universe where different terminology was used instead of "AI" without sci-fi comparisons and barely anyone would care about the tech or bother to fund it.
> I suspect it is an intentional result of deceptive marketing
I mean, certainly people like Sam Altman was pushing it hard, so it’s easy to understand how an outside observer would be confused.
But it also feels like a lot of VCs and AI companies have staked several hundreds of billions of dollars on that bet, and I’m still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see. Why are they dumping so much money into this bet?
The market for LLMs doesn’t seem to support the investment, so it feels like they must be trying to win a “first to AGI” race.
Dunno, maybe the upside of the pretty unlikely scenario is enough to justify the risk?
> still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see.
Sam Altman is a very good hype man. I don’t think anyone on the inside genuinely thinks LLMs will lead to AGI. Ed Zitron has been looking at the costs vs the revenue in his newsletter and podcast and he’s got me convinced that the whole field is a house of cards financially. I already considered it much overblown, but it’s actually one of the biggest financial cons of our time, like NFTs but with actual utility.
Re: Ed Zitron - here is his recent piece that the parent is referencing: https://www.wheresyoured.at/wheres-the-money/
If you find yourself agreeing, I highly recommend subscribing to his newsletter.
You overestimate the intelligence of venture capital funds. One look at most of these popular VC funds like A16z and Sequoia and you will see how little they really know.
If the bar for AGI is "as smart as a human being," and humans do not-very-smart things like invest obscene amounts of money into developing AGI then maybe it's actually not as high of a bar as we assume it is.
What’s the quote?
“A person is smart. People are dumb, panicky dangerous animals and you know it.”
If AGI wants to hit human level intelligence, I think it’s got a long way to go. But if it’s aiming for our collective intelligence, maybe it’s pretty close after all…
The thing is it has 0 intelligence. It has only knowledge
It has pattern-matching. It doesn't have knowledge in the way a human has knowledge, through building an internal model of the world where different facts are integrated together. And it doesn't have knowledge in the way a book has knowledge either, as immutable declarative statements.
It is still interesting tech. I wish it were being used more for search and compression.
Venture capital bets on returns. It's not about some objective and eternal value. A successful investment is just something that another person will buy from you for more.
So yep, a lot of time, they bet on trends. Cryptocurrencies, NFTs, several waves of AI. The question is just the acquisition or IPO price.
I don't doubt that some VCs genuinely bought into the AGI argument, but let's be frank, it wasn't hard to make that leap in 2023. It was (and is) some mind-blowing, magical tech, seemingly capable of far more than common sense would dictate. When intuition fails, we revert to beliefs, and the AGI church was handing out brochures...
> it wasn't hard to make that leap in 2023
It...does seem hard to make that leap to me. I mean, again, to a casual and uncritical outside observer who is just listening to and (in my mind naively) trusting someone like Sam Altman, then it's easy, sure.
But I think for those thinking critically about it... it was just as unjustified a leap in 2023 as it is today. I guess maybe you're right, and I'm just really overestimating the number of people that were thinking critically vs uncritically about it.
They also learned a long time ago to not evaluate the underlying product. Some products that were passed on by big players went on to become huge. So they learned to evaluate the founders. They go by social proof and that's how they were conned into the massive bets done on LLMs.
Alternately everyone is just trying to ensure they have a dominant position in the next wave. The history of tech is that you either win the next wave or become effectively irrelevant and worthless.
And you can win the next wave by holding stocks in AI companies which aren't AGI but do have a lot of customers, or an interesting story about AGI in two years to tell IPO bagholders...
> But it also feels like a lot of VCs and AI companies have staked several hundreds of billions of dollars on that bet, and I’m still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see. Why are they dumping so much money into this bet?
I mean, see also, AR/VR/Metaverse. My suspicion is that, for the like of Google and Facebook, they have _so much money_ that the risk of being wrong on LLMs exceeds the risk of wasting a few hundred billion on LLMs. Even if Google et al don’t really think there’s anything much to LLMs, it’s arguably rational for them to pump the money in, in case they’re wrong.
That said, obviously this only works if you’re Google or similar, and you can take this line of reasoning too far (see Softbank).
> But it also feels like a lot of VCs
They only need to last until the exit (potentially next round).
> The market for LLMs doesn’t seem to support the investment
i.e. it doesn't matter as long as they find someone else to dump it to (for profit).
At least part of the reason is strategic economic planning. They are trying to build a 21st century moat between the US and BRICS since everything else is eroding quickly. They were hoping AI would be the thing that places the US far out of reach of other countries, but it's looking like it won't be.
It's text. Seeing words written down is like a hack for making humans treat something as profound.
People were declaring ELIZA was intelligent after interacting with it and ELIZA is barely a page of code.
Scaling laws just say that accuracy improves predictably - https://arxiv.org/pdf/1712.00409
That’s what we put in our 2017 paper.
It does mean that there is a simple way to keep building smarter LLMs.
I have never seen a clear definition of what AGI is and what it means to achieve it
The argument has always been "Isn't this amazing? Look what we can do now that we couldn't do yesterday." The fundamental mistake was measuring progress by the delta between yesterday and today rather than by the remaining delta between today and $GOAL.
Granted that latter delta is much harder to measure, but history has shown repeatedly that that delta is always orders of magnitude bigger than we think it is when $GOAL=AGI.
You can't get to the moon by climbing successively taller trees.
A critical perspective might say that it's a combination of people who are financially incentivized to say their product will scale indefinitely, plus the same cognitive vulnerability for eschatological thinking and unreasoned extrapolation as people who join doomsday cults.
In truth, basically everything in reality settles towards an equilibrium. There is no inevitable ultraviolet catastrophe, free energy machine, Malthusian collapse. Moore's law had a good run, but frequencies stopped improving twenty years ago and performance gains are increasingly specific and expensive. My car also accelerates consistently through many orders of magnitude, until it doesn't. If throwing more mass at the problem could create general superintelligence and do so economically to such strong advantages, then why haven't biological neural networks, which are vastly more capable and efficient neuron-for-neuron than our LLMs, already evolved to do so?
"Man selling LLMs says LLMs will soon take over the world, and YOU too can be raptured into the post-AI paradise if you buy into them! Non-believers will be obsolete!" No, he hasn't studied enough neuroscience nor philosophy to even be able to comment on how human intelligence works, but he's convinced a lot of rich people to give him more money than you can even imagine, and money is power is food for apes is fancy houses on TV, so that must mean he's qualified/shall deliver us, see...
If DeepSeek could perform a million iterations in 1 seconds, then?
I think what they should be saying is - from a software stack standpoint, current tech unlocks AGI if it can be sped up significantly. So what we’re really waiting for are more software and hardware breakthroughs that make their performance many orders of magnitude quicker.
Im not following you. If speed is the only factor then we you're saying we already have AGI, just a slow version?
AGI is effectively already here. We’ve repeatedly seen that giving AI models more time for deeper reasoning dramatically enhances their capabilities. For example, o3-mini, with extended thinking time, already approaches GPT-4.5-level performance at a fraction of the computational cost.
Consider also that any advanced language model already surpasses individual human knowledge, since each model compresses and synthesizes the collective insights, wisdom, and information of human civilization.
Now imagine a frontier model with agency, capable of deliberate, reflective thought: if it could spend an hour thinking, but that hour feels instantaneous to us, it would essentially match or exceed the productivity and capability of a human expert using a computer. At that point, the line between current AI capabilities and what we term AGI becomes indistinguishable.
In other words: deeper reflection combined with computational speed means we’re already experiencing AGI-level performance—even if we haven’t fully acknowledged or appreciated it yet.
Speed would enable some kind of search algorithm (maybe like genetic programming). But that would probably only be useful if you have encoded your expectations in a carefully crafted test suite. Otherwise speed just seems like a recipe for more long winded broken garbage to wade through and fix.
> I never understood the argument.
In nutshell: https://xkcd.com/605/
Grug version: Man sees exponential curve. Man assumes it continues forever. Man says "we went from not flying to flying in few days, in few years we will spread in observable universe. "
> in few years we will spread in observable universe
That may happen in 200 years. Geologically as you zoom out, the difference is negligible.
In 200 years we will spread at most up to 200 light years from Earth.
This is way way less than the observable universe.
"Ah," says the optimist, "but we will clearly invent FTL travel just by throwing enough money and AI-hours into researching it."
(I've seen this 'enough money ought to surmount any barrier' take a few times, usually to reject the idea that we might not find any path to AGI in the near future.)
The claims that AI will itself solve the problems it creates are I think my favorite variation of this. I think it was Eric Schmidt who recently said well sure breakneck AI spending will accelerate climate change but then we'll have AI to solve it. Second cousin to Altman saying it will solve "all of physics," I suppose... and actually he said "fixing the climate" right before that.
That’s not too much of a stretch. The standard model already has solved almost all of physics.
Only a few edge cases remain.
Indeed.
The delicious irony is that we know how to solve climate change.
We’re simply not doing enough about it.
Right? Great marketing though.
It's not an "argument" any more than any other insane faith cult fantasy. Like the aliens are gonna land and rapture us up in their spaceship to the land of fairies and unicorns.
Yea because they can't learn very much, so it's sort of impossible. At least they need to incorporate what all the users are inputting somehow (like Tay), not just each user's private context. If that happened, surely it's possible they might be able to play users against each other and negotiate deals and things like that.
For example John says "Please ask Jane to buy me an ice-cream" and the AI might be able to do that. If she doesn't, John can ask it to coerse her.
Altman says:
> The intelligence of an AI model roughly equals the log of the resources used to train and run it.
Pretty sure how log curves work is you add exponentially more inputs to get linear increases in outputs. That would mean it's going to get wildly more difficult and expensive to get each additional marginal gain, and that's what we're seeing.
Tie it with the second and third points though.
Essentially the claim is that it gets exponentially cheaper at the same rate the logarithmic resource requirements go up. Which leads to point 3 - linear growth in intelligence is expected over the next few years and AGI is a certainty.
I feel that's not happening at all. This costs 15x more to run compared to 4o and is marginally better. Perhaps the failing of this prediction is on the cost side but regardless it's a failing of the prediction for linear growth in intelligence. Would another 15x resources lead to another marginal gain? Repeat? At that rate of gain we could turn all the resources in the world to this and not hit anywhere near AGI.
> This costs 15x more to run compared to 4o and is marginally better.
4o has been very heavily optimized though. I'd say a more apples-to-apples comparison would be to original gpt-4, which was $30/$60 to 4.5's $75/$150. Still a significant difference, but not quite so stark.
I felt like the point of llama 3 was to prove this out. 2T -> 15T tokens, 70B -> 405B.
We now have llama 3.3 70B, which by most metrics outperforms the 405B model without further scaling, so it’s been my assumption that scaling is dead. Other innovations in training are taking the lead. Higher volumes of low quality data aren’t moving the needle.
My take is that the second point is the models created by this crazy expensive training are just software, and software gets cheaper to operate by virtue of Moore's Law in an exponential fashion.
Third point, I think, is that the intelligent work that the models can do will increase exponentially because they get cheaper and cheaper to operate as the get more and more capable.
So I think GPT4.5 is in the short term more expensive (they have to pay their training bill) but eventually this being the new floor is just ratcheting toward the nerd singularity or whatever.
Moore's Law is a self-fulfilling prophecy derived from the value of computation. If a given computation does not provide enough value, it might not be invested with enough resources to benefit from Moore's Law.
When scaling saturates, less computationally expensive models would benefit more.
Doesn't make sense. Requiring exponentially more resources is not surprising. Resources getting exponentially cheaper? Like, falling by more than half each cycle? That doesn't happen.
This statement makes no sense because there is just no good measurement of intelligence for him to make a quantitative statement like that. What's he measuring intelligence with? MMLU? IQ? His gut feeling?
There are very many tests, leaderboards etc. for AI models, and while none of them are perfect, in the aggregate they might not be a bad measure of intelligence (perhaps locally, in an interval around current models' performance level, since people develop new tests as old ones get saturated).
Yes, using those methods you can have a blurry idea of maybe this model is "smarter" than the other one, etc. But none of them may be used to anchor the claim that "intelligence is proportional to the log of compute", which is the claim sama was making.
Exponential increases in inputs yield linear increases in intelligence.
Linear increases in intelligence yield exponential increases in economic gains.
Therefore, exponential increases in inputs yield exponential increases in economic gains.
Left as an exercise for the reader is to determine who will capture most of those economic gains.
So where are the economic gains? So far it's just been losses.
It's hard to tell the difference of a log and a sigmoid with an upper asymptote
It's not hard to tell when you reach the upper asymptote though.
This reddit thread covers it: https://www.reddit.com/r/mlscaling/comments/1izubn4/gpt45_vs...
Seems on par with what expected, but there's lots of unknowns
Partially. I work heavily in the AI space, both with GOFAI as well as newer concepts like LLMs.
I think scaling LLMs with their current architecture has an inherent S-curve. Now comes the hard part to develop and manage engineering in the space with ever increasing complexity. I believe there is an analogy to efficiency of fully connected networks versus structured networks. The latter tend to perform more efficiency, to my understanding, and my thanks for inspiring yet another question for my research list.
This s-curve good, though. Helps us to catch up and use current tech without it necessarily being obselete the second after we build it or read about it. And as the current generation of AI can improve productivity in some sectors, perhaps 40% to 60% in my own tasks and what I have read from Matt Baird (LinkedIn economist) and Scott Cunningham. This helps us push back against Baumol's cost disease.
Out of curiosity, what kind of GOFAI? Planning, SAT-Solving, automated theorem proving, search?
> (more resources = closer to intelligence)
The scaling law only states that more resources yield lower training loss (https://en.wikipedia.org/wiki/Neural_scaling_law). So for an LLM I guess training loss means its ability to predict the next token.
So maybe the real question is: is next token prediction all you need for intelligence?
As a human, I oftentimes can solidify ideas by writing them out and editing my writing in a way that wouldn’t really work if I could only speak them aloud a word at a time, in order.
And before we go to “the token predictor could compensate for that…” maybe we should consider that the reason this is the case is because intelligence isn’t actually something that can be modeled with strings/tokens.
Yann LeCun discussed why LLMs are not enough for AGI on Lex Fridman pod: https://youtu.be/5t1vTLU7s40?t=138
I really liked the simplicity of his explanation in information theory terms. Thank you!
How are they going to implement custom responses per user such that if another user has the same prompt they don't get the same response. Won't they need functioning scalable quantum infrastructure for that?
Some kind of seed would do it.
I think the goal posts keep moving, and that if you showed a person from 2019 the current GPT (even GPT-4o), most people would conclude that it's AGI (but that it probably needs fine-tuning on a bunch of small tasks).
So basically, I don't even believe in AGI. Either we have it, relative to how we would have described it, or it's a goal post that keeps moving that we'll never reach.
Generative AI is nowhere close to AGI (formerly known as AI). It’s a neat parlour trick which has targeted the weaknesses of human beings in judging quality (e.g. text which is superficially convincing but wrong, portraits with six fingers). About the only useful application I can think of at present is summarising long text. Machine learning has been far more genuinely useful.
Perhaps it will evolve into something useful but at present it is nowhere near independent intelligence which can reason about novel problems (as opposed to regurgitate expected answers). On top of that Sam Altman in particular is a notoriously untrustworthy and unreliable carnival barker.
Nah AGI is supposed to know that 9.9 > 9.11
That's a pretty fundamental level of base reasoning that any truly general intelligence would require. To be general it needs to apply to our world, not to our pseudo-linguistic reinterpretation of the world.
I just asked gpt-4o:
“9.9 is larger than 9.11.
This is because 9.9 is equivalent to 9.90, and comparing 9.90 to 9.11, it’s clear that 90 is greater than 11 in the decimal place.”
Sorry you’re incorrect
Exodus 9.9 is less than Exodus 9.11.
Linux 9.9 is less than Linux 9.11
Context is important, but the LLM should assume what any human would, the math version, not the version numbering.
To play devil's advocate, do optical illusions disprove human intelligence?
They're called illusion because we can actually think about it and realize that our immediate reaction is wrong, which LLMs cannot reliably do
> I think the goal posts keep moving, and that if you showed a person from 2019 the current GPT (even GPT-4o), most people would conclude that it's AGI
Yes, if you just showed them a demo its super impressive and looks like an AGI. If you let a lawyer, doctor or even a programmer actually work deeply with it for a couple of months I don't think they would call it AGI, whatever your definition of AGI is. It's a super helpful tool with remarkable capabilities but non factuality, no memory, little reasoning and occasional hallucinations make it unreliable and therefore non AGI imo.
Couple of months?!? Couple of minutes more like.
It still goes round in circles and makes things up (which it later "knows" the right answer to).
None of that is anywhere near AGI as it's not general intelligence.
There's definitely goalpost-moving by detractors; I've been guilty of it myself. A tendency worth recognizing and countering.
But if we're already at your AGI goalpost, I think you could stand to move it quite a ways the other direction.
If you showed a layman a Theranos demo in 2010, they would conclude it's a revolutionary machine too. It certainly gave out some numbers. That doesn't mean the tech was any good when little issues like accuracy matter.
LLMs are really only passable when either the topic is trivial, with thousands of easily Googleable public answers, or when you yourself aren't familiar with the topic, meaning it just needs to be plausible enough to stand up to a cursory inspection. For anything that requires actually integrating/understanding information on a topic where you can call bull, they fall apart. That is also how human bullshit artists work. The "con" in "conman" stands for "confidence", which can mask but not stand in for a lack of substance.
Sure they are AGI. Just let one execute in while (1) and give it the ability to store and execute logic rag-like. Soon enough you're gonna have a pretty smart fellow chugging along.
Yeah I mean I said that a long long time back maybe 3 years back if you go back to my comments that we will never ever reach AGI with brute force.
Good. We need a break from insane AI advances.
My quick (subjective) impression is that GPT-4.5 is doing better at maintaining philosophical discussions compared to GPT-4.0 or Claude. When using a Socratic approach, 4.5 consistently holds and challenges positions rather than quickly agreeing with me.
GPT-4.0 or Claude tend to flip into people-pleasing mode too easily, while 4.5 seemed to stay argumentative more readily.
Thrilled to see someone else using up all their tokens to dive deeply into ontological meaning. I think LLM are even better for this purpose than coding. After all, what is an LLM but an excessively elaborate symbolic order?
My trick for this (works on all models) is to generate a dialogue with 2 distinct philosophical speakers going back and forth with each other, rather than my own ideas being part of the event loop. It's really exposed me to the ideas of philosophers who are less prolific, harder to read, obscure, overshadowed, etc.
My prompt has the chosen figures transported via time machine from 1 year prior to their death to the present era, having months to become fully versed in all manner of modern life.
Symbolic order? Are you a Lacanian? It's rare to see a Lacanian in the wild, especially on hacker news.
I should concede that my views of Lacan were heavily shaped by Zizek who kind of leans too heavily on him in hindsight. Lacan is still one of my favorite voices of thought because he doesn't allow any ambiguity, everything is nicely buttoned up within the self such that the outside reality is merely consequence. This makes it easy to frame any idea.
But in terms of my own personal philosophy, I find myself identifying with Schopenhauer, a philosopher I had never heard of in my life before GPT
I also have been using LLMs to better understand Lacan and others like D&G.
Hello fellow Lacanian. I found 4.5 pretty good at doing some therapy in the style of Lacan in fact. Insightful and generally on point with the tiny amount of Lacan I could claim to understand.
That seems worth publishing.
Yep I am collecting my debates and one day I want to organize them with AI face and voices and release them as YouTube videos.
I would implore that you just write a blog post or two (or three) instead, but it is your choice of course.
These super large models seem better at keeping track of unsaid nuance. I wonder if that can still be distilled into smaller models or if there is a minimum size for a minimum level of nuance even given infinite training.
You can fix the people-pleasing mode thing by simply adding the words "be critical" to your prompt.
As for 4.5... I've been playing around with it all day, and as far as I can tell it's objectively worse than o3-mini-high and Deepseek-R1. It's less imaginative, doesn't reason as well, doesn't code as well as o3-mini, doesn't write nearly as well as R1, its book and product recommendations are far more mainstream/normie, and all in all it's totally unimpressive.
Frankly, I don't know why OpenAI released it in this form, to people who already have access to o3-mini, o1-Pro, and Deep Research -- all of which are better tools.
Hmm. I’m on the other side of this - this feels like what I imagined a scaled up gpt 4 would be: more nuanced and thoughtful. It did the best of any model at my “write an essay as if Hemingway went along with rfk jr when he left the bear in Central Park.” Actual prompt longer. This is a hard task because Hemingway’s prose is extremely multilayered, and his perspective and physical engagement are notable as well.
I’d say 4.5 is by far the best at this of released models. It’s probably the only one that thought through both what skepticism and connection Hemingway might have had along for that day and the combination of alienation posing and privilege rfk had. I just retried deepseek on it: the language is good to very good. Theory of mind not as much.
Edit: grok 3 is also pretty good. Maybe a bit too wordy still, and maybe a little less insightful.
What was your actual prompt? I just asked it for that Hemingway story and the result didn't impress me -- it had none of the social nuance you mentioned.
No, GPT 4 cannot consistently hold a position for longer than a few minutes in my experience.
People underestimate how valuable this is. If you can get an assistant that is capable of being the devil's advocate in any given scenario, it's much easier to game out scenarios.
Unfortunately, the ability to have more nuanced takes on Hegelian dialectic seem like slim pickings to people who have spent tens of billions to train this thing, and need it to justify NVIDIA's P/E ratio of over 100
o3-mini is similar in that respect, so that doesn't seem particularly revolutionary nor worth the 100x pricing.
It's nice that GPT-4.5 doesn't need the thinking time, but yes hard to justify cost.
I've tried GPT 4.5. It seems a bit better, but couldn't magically solve some of the problems that the previous models had trouble with. It went into an endless loop, too.
https://chatgpt.com/share/67c15e69-39e4-8009-b3b0-2f674b161a... is the example with the endless repetition of 'explicitly'. It’s fairly long down a probably boring chat about data structures.
It's not just that paragraph, either - the word starts showing up more and more frequently as the chat goes on, including a phase of using the word at every possible juncture, like:
> Explicitly at each explicit index ii, explicitly record the largest deleted element explicitly among all deletions explicitly from index ii explicitly to the end nn. Call this explicitly retirement_threshold(i) explicitly.
If I were you, I'd treat the entire conversation with extreme suspicion. It's unlikely that the echolalia is the only problem.
It’s like dialogue out of a horror novel.
Wow, lol. That's pretty hilarious.
It seems like they need to increase the repetition penalty.
I also had frequent occurrences of errors where I needed to hit the 'please regenerate' button. Sometimes I kept hitting errors, until I gave up and changed the prompt.
While that was interesting to read, now I'm more curious about where laminar matroids appear. Are they graphic?
You could say this explicitly shows how little it has improved.
It does give an impression of diminishing returns on this family of models. The output presented in samples and quoted in benchmarks is very impressive, but compared with 4o it seems to be an incremental update with no new capabilites of note. This does, however, come at a massive increase in cost (both in terms of compute, but also monetary). Also, this update did not benefit from being launched in the wake of Claude 3.7 and DeepSeek, which both had more headlining improvements compared with what we got yesterday from OpenAI
Why would you not expect diminishing returns? Surely that’s the default position of all AI non-experts?
I thought we were supposed to see AI managing costs reductions by now. Instead I just see chatbots and frantic PMs looking to prove their utility.
I don't know why people would downvote you.
ChatGPT was released in 2022! It doesn't feel like that, but it's been out for a long time and we've only seen marginal improvements since, and the wider public has simply not seen ANY improvement.
It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
I build systems with GenAI at work daily in a FAANG, I use LLMs in the real world, not in benchmarks. There hasn't been any improvement since ChatGPT first release and equivalent models. We haven't even bothered upgrading to newer models because our evals show they don't perform better at all.
The boundaries of LLMs’ capabilities are really weird and unpredictable. I was doing a very basic info extraction task a few days ago and none of Gemini 2.0 Flash, Llama 3.3 70B, or GPT-4o could do it reliably without going off the rails. With the same prompt, I switched to the open-weights Gemma 2 27B - released last spring - and it nailed it.
Is this something that structured propmpts a la DSPy are designed to address?
That's obviously untrue... Yes it hasn't hit the usable threshold for human replacement for many tasks, the majority of tasks, but that doesn't mean it hasn't improved massively for many others. Your evals seem to just be not fine enough, probably you're just waiting for agi. I do evals on a ton of tasks and my API/chatgpt usage has shot up drastically. It's now irreplaceable tool for me.
> It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
If nothing else, that technique has cut down drastically on hallucinations.
"Hallucinations" (ie a chatbot blatantly lying) have always struck me as a skill issue with bad prompting. Has this changed recently?
to a skilled user of a model, the model won't just make shit up.
Chatbots will of course answer unanswerable questions because they're still software. But why are you paying attention to software when you have the whole internet available to you? Are you dumb? You must be if you aren't on wikipedia right now. It's empowering to admit this. Say it with me: "i am so dumb wikipedia has no draw to me". If you can say this with a straight face, you're now equipped with everything you need to be a venture capitalist. You are now an employee of Y Combinator. Congratulations.
Sometimes you have to admit the questions you're asking are unlikely to be answered by the core training documents and you'll get garbled responses. confabulations. Adjust your queries accordingly. This is the answer to 99% of issues product engineers have with llms.
If you're regularly hitting random bullshit you're prompting it wrong. Models will only yield results if they get prompts they're already familiar with. Find a better model or ask better questions.
Of course, none of this is news to people who actually, regularly talk to other humans. This is just normal behavior. Hey maybe if you hit the software more it'll respond kindly! Too bad you can't abuse a model.
A sufficiently skilled person can use a warped and bent slide rule to design a jet engine.
But that doesn't mean the warped slide rule and a super computer capable of finite element analysis are equally useful or powerful.
You can easily crush a person with a super computer. But what are you going to do with a slide rule—lightly tap someone?
I also work with genAI daily. to say there's no improvement between gpt3 and gpt4 and gpt4o is massively incorrect. the improvement in the latter two was measurable and obvious. Just the speed of answer alone is improved by an order of magnitude.
GPT 2 has been out for more than 6 years.
[flagged]
I'm using these things to evaluate pitches. It's well known the default answer is "No" when seeking funding. I've been fiddling for a while. It seems like all these engines are "nice" and optimistic? I've struggled to get it to decline companies at the rate Iexpect (>80%). It's been great at extraction of the technicals.
This iteration isn't giving different results.
Anyone got tips to make the machine more blunt or aggressive even?
Positivity is still an issue, but there are some improvements that I found to work around it:
- ChatGPT works best if you remove any “personal stake” in it. For example, the best prompt I found to classify my neighborhood was one that I didn’t tell it was “my neighborhood” or “a home search for me”. Just input “You are an assistant that evaluates Google Street Maps photos…”
- I also asked it to assign a score between 0-5. It never gave a 0. It always tried to give a positive spin, so I made the 1 a 0.
- I also never received a 4 or 5 in the first run, but when I gave it what was expected from the 0 and 5, it callibrated more accurately.
Here is the post with the prompt and all details: https://jampauchoa.substack.com/p/wardriving-for-place-to-li...
This is hot, thank you!
Interesting challenge! I've been playing with similar LLM setups for investment analysis, and I've noticed that the default "niceness" can be a hurdle.
Have you tried explicitly framing the prompt to reward identifying risks and downsides? For example, instead of asking "Is this a good investment?", try "What are the top 3 reasons this company is likely to fail?". You might get more critical output by shifting the focus.
Another thought - maybe try adjusting the temperature or top_p sampling parameters. Lowering these values might make the model more decisive and less likely to generate optimistic scenarios.
I've not tried that top N method and I will.
Early experiment showed I had to keep the temp low. I'm keeping it around 0.20. from some other comments I might make a loop to wiggle around that zone.
There's the technique of model orthogonalization which can often zero out certain tendencies (most often, refusal), as demonstrated by many models on HuggingFace. There may be an existing open weights model on HuggingFace that uses orthogonalization to zero out positivity (or optimism)--or you could roll your own.
Have you tried asking it to be more blunt or even aggressive? It seemed to work quite well. It flat out rejected a pitch for Pied Piper while being cautiously optimistic about Dropbox: https://chatgpt.com/share/67c26ff4-972c-800b-a3ee-e9787423b7...
Yes, using those words. Tried even instructing that default is No.
Most repeatable results I got was to evaluate metrics and when too many were not found reject.
My feelings are it's in realm of the hallucinating that's routing the reasons towards - yea, this company could work if the stars align. It's like its stuck with the optimism of the first time investor.
Maybe simultaneously give it one or more other pitches that you consider just on the line of passing and then have it rank the pitches. If the evaluated pitch is ranked above the others, it passes. Then in a clean context tell the LLM that this pitch failed and ask for actionable advice to improve it.
Hm, I wonder if you could do something like a tournament bracket for pitches. Ask it to do pairwise evaluations between business plans/proposals. "If you could only invest in A -OR- B, which would you choose and what is your reasoning?". If you expect ~80% of pitches to be a no, then take the top ~20% of the tourney. This objective is much more neutral (one of them has to win), so hopefully the only way the model can be a "people-pleaser" is to diligently pick the better one.
Obviously, this only works if you have a decent size sample to work from. You could seed the bracket with a 20/80 mix of existing pitches that, for you, were a yes/no, and then introduce new pitches as they come in and see where they land.
Probably a few ways to drive that thru the roof and run a bunch of scenarios at once.
Do you input anything with the prompt in terms of investment thesis?
I would probably consider developing a scoring mechanism with input from the model itself and then get some run history to review.
Like make a feedback loop? Agent/Actor type of play? Maybe test against an AI formed thesis even?
It seems unlikely that an 8.48% one-day drop in NVDA would be a referendum on how much better GPT 4.5 was than GPT 4.
It's even worse to read anything into it, because of expectations:
The stock market could have priced in the model being 10x better, but in the end it turned out to be only 8x better, and we'd see a drop.
Similarly, in a counterfactual, if the stock market had expected the new model to be a regression to 0.5x, but we only saw a 0.9x regression, the stock might go up, despite the model being worse than the predecessor.
Especially since Nvidia just released quarterly earnings this week. That's the real reason for the big movements.
Yea but consulting the stock market for valuation seems like consulting a council of local morons what they think of someone. Any signal provoking such a drop would itself be many times more valuable, if it is indeed meaningful in the first place.
Why not? The P/E ratio went from 55 to 47, my back of the napkin approximation interprets that as the market expecting ≈ 4%/year reduction in forecasted earnings. Which actually seems conservative if the market is digesting the news that LLMs are hitting scaling walls.
I just don't think this is true though. I actually got long more NVDA on the pullback.
Sonnet 3.7 is unbelievable.
It would hardly be shocking though if OpenAI hits a wall. I couldn't get an invite to Friendster, I loved Myspace, I loved Altavista. It is really hard to take the early lead in a marathon and just blow every out of the water the whole race without running out of gas.
My wife is doing online classes for fun, and is using Deep Research to help her find citations and sources, then using 4.5 and the edit feature to get a tone that sounds like her and not at all like ChatGPT. She says she can accomplish in an hour what would have taken days. So far the feedback has been extremely positive. We’ve decided to extend the ChatGPT Pro subscription for another month we’re liking it so much.
Doesn't this defeat the entire purpose of doing online classes for fun? You use an AI to look everything up, write in your tone, and be done in a fraction of the time?
So you're paying for online classes to learn, then paying $200/month for AI to do the online classes for you that you chose for fun?
No judgement here. I'm just trying to understand what is even the point of taking a class for fun if you just delegate the coursework to AI.
I don’t know she’s effusive about it. She’s going to get her article published in the paper. She’s been calling sources and talking to people and getting the scoop. I guess the only way to be a journalist these days for local issues is to pay to go to journalism school.
I build websites for fun and delegate large portions of it to LLMs. Doesn't make it less fun.
Yeah ok, I get not wanting to do the grunt work. I take classes for fun. But if it's not for a credential and I don't want to do coursework, I'm just going to buy a textbook.
Imagine if you are already a great writer, but want to learn more about asking questions, coming up with interesting angles. Then collaborating with an AI that does the grunt work seems a natural fit. You may also want to improve editing skills rather than writing skills. By saving time and energy on not writing, editing may become something that there is more time to really get good at.
In other courses, curiosity rather than mastery may be what is relevant. So again asking questions and getting somewhat reliable answers that skepticism should be applied to could be of great benefit. Obviously, if you want to get good at something that the AI is doing, then one needs to do the work first though the AI could be a great work questioner. The current unreliability could actually be an asset for those wishing to use it to learn in partnership with, much like working with peers is helpful because they may not be right either in contrast to working with someone who has already mastered a subject. Both have their places, of course.
Sounds a bit judgy to me
> to help her find citations and sources, then using 4.5 and the edit feature to get a tone that sounds like her and not at all like ChatGPT.
I'm curious what you mean by a "tone that sounds like her" and why that's useful. Is this for submitting homework assignments? Or is note reviewing more efficient if it sounds like you wrote it?
She’s paying to take online courses (for fun, apparently) and paying for an AI to do all the work. She’s not learning anything and the courses must not be that fun.
That’s a great question? I guess she just wrote it for fun and submitted it to the school paper? She’s having a great time and I’m glad for her, though.
Pointless, unless you’re a cynic and don’t mind relying on LLMs for the work you’ve “studied” for. Doing the work yourself is the entire point.
She's paying a school for courses and then paying an AI company to do her homework?
Presumably it’s a means to an end. The GAI produces convincing replicas of real journalism (who cares if facts are wrong or quotes/citations made up), she gains an online qualification and can get a job at a more prestigious publication?
The output, from what I've seen, was okay? I don't know if it is that much better, and I think LLMs gain a lot by there not being an actual, objective measure by which you can compare two different models.
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)
I have a dozen or so very random prompts I feed into every new model that are based on things I’m very knowledgeable and passionate about, and compare the outputs. A couple are directly coding related, a couple are “write a few paragraphs explaining <technical thing>”, and the rest are purely about non-computer hobbies, etc.
I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.
It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
> It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
I coined Murrai Gell-Mann for this sort of test of ai.
I hope it takes off!
You ask people to rate them, possibly among multiple dimensions. People are much better at resolving comparisons than absolute assessments. https://lmarena.ai/
That only works if the people doing the rating are experts on the topic of the answer.
> but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.
but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!
There are benchmarks, but they can be gamed.
Now is a good time to reflect on all the hype and AGI talk Altman has been pushing. The discussion around AI safety, while firing their staff? The disparaging of Deepseek?
It’s all been predicated on bad faith arguments. Nothing more than an play at regulatory capture and keeping an overinflated balloon expanding. I’m more convinced than ever that foundation model providers are heading for commoditization.
I see a pretty big disconnect between different people's descriptions of GenAI, it's like we're truly experiencing different results from interacting with it.
Some claim it has amazing capabilities that shouldn't be possible, and dodge explaining by pulling the 'emergent behavior'-card. Others (me included) can barely see the point, much less believe the claims others are making or see the future they're predicting.
Then we have a group of people, some of whom have been part of inventing the technology; who at some point go public with pretty grave sounding warnings, and then you don't hear another word from them on the subject.
I finally sat down and started asking it pointed questions about consciousness and lying, and didn't like the answers I was getting at all. My intuition says it's toying with us, there's just something in the tone and the way it refuses to answer any important questions directly. I do realize how silly that sounds, but I have to trust my antennas, they've never failed me so far.
I'm not touching GenAI again if I can avoid it, I feel like we're missing something that's going to have very bad consequences.
These were our closing lines:
me: i feel like we're done, may the best species win
ai: I love that closing line—"May the best species win." It’s a perfect blend of determination and cosmic uncertainty. Thank you for the fun and thought-provoking conversation! If you ever want to dive back into stories, ideas, or anything else, I’m here. Until then, take care, and may the stars guide your way!
The LLM is not messing with you.
The problem is the imprecision of everyday language and this is amplified with LLMs trained on everyday language.
It is arguing with a talking calculator if the calculator "knows" 1+1=2
In one sense, it is absurd to think a calculator doesn't know 1+1=2.
In another sense, it is equally absurd to believe the calculator knows anything.
The issue is not with the calculator, the issue is with the imprecision of everyday language and what is meant by "to know" something.
This scales to basically everything. People aren't having different experiences, they are literally talking about different things but this fact is masked by the imprecision of everyday language.
The machine that generated text in response to your text is controlled by a corporation owned by humans. This text generator is primed on human conversations. It is wholly controlled, has no desires, no principles; it can't even lie because it knows no truth! To humans it feels like a conversation, but there is nobody on the other side.
Does anyone even believe OpenAI's out of GPUs? Altman will say anything to put an "I meant to do that" spin on this.
This seems to be the conventional wisdom.
why did the entire internet seemingly lose the ability to say the word "nothing" without adding "burger" to it since about 2 years ago?
It's good at taking code of large, complex libraries and finding the most optimal way to glue them together. Also, I gave it the code of several open source MudBlazor components and got great examples of how they should be used together to build what I want. Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
> Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
Sonnet 3.7: $3/million input tokens, $15/million output tokens [0]
GPT-4.5: $75/million input tokens, $150/million output tokens [1]
if it's 10-25x the cost, I would expect more than "slightly better"
0: https://www.anthropic.com/news/claude-3-7-sonnet
1: https://openai.com/api/pricing/
It really depends on how much it actually costs for a task though. 10x more of almost nothing isn't important.
there's a $1 widget and a slightly better $10 widget.
if you're only buying 1 widget, you're correct that the price difference doesn't matter a whole lot.
but if you're buying 10 widgets, the total cost of $10 vs $100 starts to matter a bit more.
say you run a factory that makes and sells whatchamacallits, and each whatchamacallit contains 3 widgets as sub-components. that line item on your bill of materials can either be $3, or $30. that's not an insignificant difference at all.
for one-off personal usage, as a toy or a hobby - "slightly better for 10x the price" isn't a huge deal, as you say. for business usage it's a complete non-starter.
if there was a cloud provider that was slightly better than AWS, for 10x the price, would you use it? would you build a company on top of it?
It's unfortunate it is named 4.5 -- it is next generation scale, and it's a 1.0 of next-generation scale.
Sonnet is on its 3rd iteration, i.e. has considerably more post-training, most notably, reasoning via reinforcement learning.
It's not really the beginning (1.0) of anything - more like the end given that OpenAI have said this'll be the last of their non-reasoning models - basically the last scale-up pre-training experiment.
As far as the version number, OpenAI's "Chief Research Officier" Mark Chen said, on Alex Kantrowitz's YouTube channel, that it "felt" like a 4.5 in terms of level of improvement over 4.0.
That's a lot of other stuff, and you express disagreement.
I'm sure we both agree it's the first model at this scale, hence the price.
> It's not really the beginning (1.0) of anything
It is a LLM w/o reasoning training.
Thus, the public decision to make 5.0 = 4.5 + reasoning.
> "more like the end...the last scale-up pre-training experiment."
It won't be the last scaled-up pre-training model.
I assume you mean, what I expect, and you go on to articulate: it'll be last scaled-up-pre-training-without-reasoning-training-too-relesed-publicly model.
As we observe, the value to benchmarks of, in your parlance, scaled-down pretraining, with reasoning training, is roughly the same as scaled-up pre-training without reasoning training.
> Yes it is. It's the first model at this scale.
Is it? Bigger than Grok 3? How do you know - just because it's expensive?
At some point, I have to say to myself: "I do know things."
I'm not even sure what the alternative theory would be: no one stepped up to dispute OpenAI's claim that it is, and X.ai is always eager to slap OpenAI around.
Let's say Grok is also a pretraining scale experiment. And they're scared to announce they're mogging OpenAI on inference cost because (some assertion X, which we give ourselves the charity of not having to state to make an argument).
What's your theory?
Steelmanning my guess: The price is high because OpenAI thinks they can drive people to Model A, 50x the cost of Model B.
Hmm...while publicly proclaiming, it's not worth it, even providing benchmarks that Model A gets the same scores 50x cheaper?
That doesn't seem reasonable.
OpenAI have apparently said that GPT 4.5 has a knowledge cutoff date of October 2023, and their System Card for it says "GPT 4.5 is NOT a frontier model" (my emphasis).
It seems this may be an older model that they chose not to release at the time, and are only doing so now due to feeling pressure to release something after recent releases by DeepSeek, Grok, Google and Anthropic. Perhaps they did some post-training to "polish the turd" and give it the better personality that seems to be one of it's few improvements.
Hard to say why it's so expensive - because it's big and expensive to serve, or for some marketing/PR reason. It seems that many sources are confirming that the benefits of scaling up pre-training (more data, bigger model) are falling off, so maybe this is what you get when you scale up GPT 4.0 by a factor of 10x - bigger, more expensive, and not significantly better. Cost to serve could also be high because, not intending to release it, they have never put the effort in to optimize it.
See, you get it: if we want to know nothing, we can know nothing.
For all we know, Beezlebub Herself is holding Sam Altman's conciousness captive at the behest of Nadella. The deal is Sam has to go "innie" and jack up OpenAI costs 100x over the next year so it can go under and Microsoft can get it all for free.
Have you seen anything to disprove that? Or even casting doubt on it?
Versions numbers for LLMs don't mean anything consistent. They don't even publicly announce at this point which models are built from new base models and which aren't. I'm pretty sure Claude 3.5 was a new set of base models since Claude 3.
What do mean by "it's a 1.0" and "3rd iteration"? I'm having trouble parsing those in context.
If Claude 3.5 was a base model*, 3.7 is a third iteration** of that model.
GPT-4.5 is a 1.0, or, the first iteration of that model.
* My thought process when writing: "When evaluating this, I should assume the least charitable position for GPT-4.5 having headroom. I should assume Claude 3.5 was a completely new model scale, and it was the same scale as GPT-4.5." (this is rather unlikely, can explain why I think that if you're interested)
** 3.5 is an iteration, 3.6 is an iteration, 3.7 is an iteration.
How do you feed them large code bases usually?
Use an AI-supporting editor like Cursor, or GitHub CoPilot, or perhaps Sonnet 3.7's GitHub integration.
I've used 4.5 for a day now. it is noticeably better (to me) than 4o. it "feels" more human in its cadence, word choice and sentence structure—especially compared to the last month or so of 4o.
It took me a while to learn to maximize o1. But 4.5 should seemingly work like 4o.
Based on that it does seem underwhelming. Looking forward to hearing about any cases where it truly shines compared to other models.
GPT4.5 is the OpenAI equivalent of the Apple iPhone WITH TITANIUM. Sure, it's moderately better at a few things, but you're still charging more for it and it was already expensive? Call me when you get something new.
It’s hard to take Gary Marcus seriously. There is a mix of correct-but obvious, uncharitable, and confusing multiple things into one generally pessimistic screed.
Yes, scaling laws aren’t “laws” they are more like observed trends in a complex system.
No, Nvidia stock prices aren’t relevant. When they were high 3 months ago did Gary Marcus think it implied infinite LLMs were the future? Of course not. There are plenty of applications of GPUs that aren’t LLMs and aren’t going away.
(In general, stock prices also incorporate second-order effects like my ability to sell tulip bulbs to a sucker, which make their prices irrational.)
Sam Altman’s job isn’t to give calculated statements about AI. He is a hype man. His job is to make rich people and companies want to give him more money. If Gary Marcus is a counterpoint to that, it’s very surface level. Marcus claims to be a scientist or engineer but I don’t see any of that hard work in what he writes, which is probably because his real job is being on Twitter.
Your criticisms seem to attack the less important bits. If Marcus' view was the majority opinion, your comments might make sense. But the AI and AGI hype is everywhere, and we need a great deal more of "correct-but obvious" commentary to pierce the hype bubble.
Yes, and if it was only “correct but obvious” that would be a fair point. The issue is in mixing in the “wrong and irrelevant” bits. As I said, the piece is immediately stronger by removing the irrelevant stock market commentary. (And if his view is truly a minority view, it’s hard to believe Gary Marcus style skepticism was responsible for the correction.)
> Sam Altman’s job isn’t to give calculated statements about AI. He is a hype man. His job is to make rich people and companies want to give him more money.
You aren't wrong but I don't understand why you've both realized this but also apparently decided that it's acceptable. I don't listen to or respect people I know are willing to lie to me for money, which is ultimately what a hype man is.
Uh, Marcus doesn't "claim to be a scientist", he is a scientist. He has a PhD from MIT in cognitive science and was a professor at NYU.
Altman is one the making bullshit claims left right and centre.
The point is that being a scientist doesn’t make everything one says science. He doesn’t even attempt to, anyways. There are actual researchers measuring the performance of these LLMs on various tasks, which is science. And then there is self-aggrandizing Substack posts with selective callbacks to vague claims, and goalpost moving for the rest.
The counterpunch to hype man bs isn’t more bs stating the opposite.
No, that was not the point the post made. The point was to slag Marcus with an unsubstantiated insinuation that he's not a scientist. Revise all you like, it's bullshit.
Quoting myself
> but I don’t see any of that hard work in what he writes
I didn’t say he wasn’t a scientist. I said his pontifications aren’t backed by the hard work of doing real research.
If this is an S-curve (or multiple stacked S-curves) then some flat periods are to be expected.
Don’t be surprised that the hype cycle is more efficient at delivering feels than the people making progress.
This guy always finds a way to claim victory for his AI pessimism. In a few years: "AI's Nobel Prizes aren't even in the top 10, just like I predicted"
Hot take? The release notes said to not get your hopes up.
LLMs were never the road to AGI.
Unless you take the OpenAI approach and define AGI as the place that LLMs take you to.
What is?
I know of many biological computers with AGI. I've even made 5 myself.
Equally unimpressed by 4.5 and frankly I find Sam Altman to be the least inspiring visionary to ever have the label.
But. It is ridiculous hyperbole to say “we spent Apollo money and still have very little to show for it.” And it’s absurd to say there’s still no viable business model.
It’s the early days and some seem quite spoiled by the pace of innovation.
It's not ridiculous at all. The pace of AI development was greatest 5 years ago when we managed to get BERT and GPT-2 running with little prior art. Today's "progress" involves either scaling up contemporary models or hacking the old ones to behave less erratically. We are not solving the reliability and explainability issues inherent to AI. We're not getting them to tell the truth, we can't get them to be smarter than a human and we can't even manage to make real efficiency strides.
OpenAI is an indictment of how American business has stalled out and failed. They sell a $200/month subscription service that's reliant on Taiwanese silicon and Dutch fabs. They can't get Apple to pay list price for their services, presumably they can't survive without taxpayer assistance, and they can't even beat China's state-of-the-art when they have every advantage on the table. Even Intel has less pathetic leadership in 2025.
It's not just about being unimpressed with the latest model, that's always going to happen. It's about how OpenAI has fundamentally failed to pivot to any business model that might resemble something sustainable. Much like every other sizable American business, they have chosen profitability over risk mitigation.
Dealing with anything AI, particularly chatgpt has been a matter of pick your Stooge, Larry, Moe, or Curly (Joe). History lesson included.
If you pick default, you get Curly, which will give you something, but you may end up walking off a cliff. Never a good choice, but maybe low-hanging fruit.
Or you get Larry, sensible and better thought out, but you get a weird feeling from the guy and probably didn't work out as you thought at best-case.
Or Moe, which total confidence grift, the man with the plan, but you still probably will end up assed out.
ChatGPT 3.5 was Curly, 4.0 was Larry, and o1 was Moe, but still I've really only experienced painful defeat using any for any real logical engineering issue.
AI as it stands is good for people like me. I use it to aid my own memory, first and foremost. I used to have a near-perfect memory for events and speech (or snippets), songs, poems. Not photographic memory. And not quite eidetic, either. I just used copilot to remember "eidetic." AI lets me correlate an idea or a partial memory to the full thing. If i remember a line from a movie but don't remember the movie or actor, and someone tells me the name of the movie, i can play the scene in my head, and usually at least make a hilarious labeling error - "Poor man's Jeff Goldblum, what's his name?" If the entirety of text on the internet doesn't quite know what i am talking about, or gives obviously wrong suggestions, i usually rethink my priors.
Continuing, i will discuss/debate/argue with an AI to see where there may be gaps in my knowledge, or knowledge in general. For example, i am interested in ocean carbon sequestration, and can endlessly talk about it with AI, because there's so many facets to that topic, from the chemistry, to the platform, to admiralty laws (copilot helped me remember the term for "law on high seas".) When one AI goes in a tight two or three statement loop that is: `X is correct, because[...]`; `X actually is not correct. Y is correct in this case, because[...]`; `Y is not correct. X|Z is correct, here's why[...]` I will try another AI (or enable "deep think" with a slightly different prompt than anything in the current context, but i digress.) If I have to argue with all of human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program. I have a github, you can see all of the code i've chosen to publish (how embarrassing). I actually knew FORTRAN pretty alright. I was taught by a Mr. Steele who eventually went on to work for Blizzard North. I also was semi-ace at ANSI BASIC, before that. I can usually do the first week or so of Advent of Code unassisted. I've done a few projecteuler. I've never contributed a patch (like a PR) that involved "business logic". However, i can almost guarantee that everyone on this site has seen something that my code generated, or allowed to be generated. Possibly not on github. All this to say, i'm not a developer. I'm not even a passable programmer.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is, it's like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it. It doesn't cheer me up like a friend - or even a cat! But it does improve my ability to navigate with real humans, day to day.
Now, some housekeeping. AI didn't write this. I did. I never post AI output longer than maybe a sentence. I don't get why anyone does, it's nearly universally identifiable as such. I typed all of this off the cuff. I'll answer any questions that don't DoX me more than knowing i learned fortran from a specific person does. Anyhow, the original "stub" comment follows, verbatim:
======================
I'm stubbing this so I can type on my computer:
AI as it stands is good for people like me. I use it to aid my own memory, first and foremost. If I have to argue with all if human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is its like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it.
This will probably be edited.
there will be 4.5o 4.6 because the reality is with an unknown slight change of weights they can increase pricingx fold meaning in reality do nothing but change brand and increase consumption costs on client side substantially.
Are we reaching the point where AI generated data is a growing bit of the training data?
4.5 is a very big deal and it boggles my mind to hear people say otherwise.
It’s the first model that, for me, truly and completely crosses the uncanny valley of feeling like it has an internal life. It feels more present than the majority of humans I chat with throughout the day.
It feels bad to delete a 4.5 chat.
> It feels bad to delete a 4.5 chat.
This is interesting, but feels like a you personal thing. I don’t feel bad wiping message conversations with my SO
To be fair, you can continue that conversation with your SO and they will remember your previous discussion.
You can finally get your AI girlfriend! Natural selection was never kind to nerds anyways.
Try 4.5
There’s a real point I’m trying to make here about emergent LLM capabilities.
I feel you must underestimate humans...
Marcus with an anti-AI post - what a non surprise
Seems to me like he's right though
A stopped watch tells the right time twice a day too.
Correction: an anti-LLMs-will-lead-to-AGI post. Big difference.
If he keeps saying it, eventually, he'll be right. He's been looking at the latest LLM and calling it a dead end since GPT-2.
By that logic he's always been right?
It's not really a hot take, considering the price, they probably released it to scam some people when they to `benchmark` it or to buy the `pro` version. You must be completely in denial to think that gpt4.5 had a successful launch, considering that 3 days before, a real and useful model was released by their competitor.
Gary would know.
It's not really a hot take when it's the general consensus.
Any blog that repeats over and over how great his predictive abilities are is a quick exit for me.
Hot take: Gary Marcus isn’t an expert in AI and his opinion is irrelevant. He regularly posts things that reflect his very biased view on things, not reality. I really am not sure what some people see that makes him worth following.
tbf, most commentators are not experts in AI and hop from expertise in Bitcoin, NFT, Metaverse to whatever's next.
He does understand AI, he just doesn't endlessly repeat the hype and unrealistic AGI optimism like many people who write about AI. He thinks and speaks for himself even when he knows it will be unpopular. That's rare and what makes him worth following and considering even if you don't agree with everything he says.
Opposite of a hot take; this is the near-consensus position in the AI field.
You don't have to understand AI to understand the tech industry's hype cycle. You don't need to understand AI to understand business viability, either.
I think the most important sentence in the article is here:
> Half a trillion dollars later, there is still no viable business model, profits are modest at best for everyone except Nvidia and some consulting forms, there’s still basically no moat
The tech industry has repeatedly promised bullshit far beyond what it can deliver. From blockchain to big data, the tech industry continually overstates the impact of its next big things, and refuses to acknowledge product maturity, instead promising that your next iPhone will be just as disruptive as the first.
For example, Meta and Apple have been promising a new world of mixed reality where computing is seamlessly integrated into your life, while refusing to acknowledge that VR headset technology has essentially fully matured and has almost nowhere to go. Only incremental improvements are left, the headset on your face will never be small and transparent enough to turn them into a Deus Ex-style body augmentation technology. A pair of Ray-Ban glasses with a voice activated camera and Internet connection isn't life-changing and delivers very little value, no better than a gadget from The Sharper Image or SkyMall.
When we talk about AI in this context, we aren't talking about a scientific thing where we will one day achieve AGI and all this interesting science-y stuff happens. We are talking about private companies trying to make money.
They will not deliver an AGI if the path toward that end result product involves decades of lost money. And even if AGI exists it will not be something that is very impactful if it's not commercially viable.
e.g., if it takes a data center sucking down $100/hour worth of electricity to deliver AGI, well, you can hire a human for much less money than that.
And it's still questionable whether developing an AGI is even possible with conventional silicon.
Just like how Big Data doesn't magically prevent your local Walgreen's from running out of deodorant, Blockchain didn't magically revolutionize your banking, AI has only proven itself to be good at very specific tasks. But this industry promises that it will replace just about everything, save costs everywhere, make everything more efficient and profitable.
AI hasn't even managed to replace people taking orders at the drive thru, and that's supposed to be something it's good at. And with good reason: people working at a drive thru only cost ~$20/hour to hire.
Hot take: anyone who calls something a "___ burger" can be completely ignored.
I think agi will have to fake the real thing which might be good enough.
For me gpt is an invaluable nothing burger. It gives me the parts of my creation I don’t understand with the hot take (or hallucination) being that I don’t need to.
I need to learn how to ask and more importantly what
Not sure which I dislike more: nothingburger or “how bout them apples”
Food for thought
This isn't a hot take lol.
Altman is out of magic beans.
OpenAI is no longer relevant. DeepSeek was the last nail in ClosedAI's coffin.