__MatrixMan__ a day ago

This was posted yesterday: https://news.ycombinator.com/item?id=42860770

And since X sucks today just about as much as it sucked yesterday: https://nitter.poast.org/carrigmat/status/188424436990727810...

  • mrinterweb 21 hours ago

    Thank you for the alternative to X. I feel somehow complicit in supporting someone likely aligned with something unspeakable any time I click an X link.

  • alias_neo a day ago

    I don't know if it's just me but I can't seem to view the nitter link, says the tweet isn't found, I can't use the OP Twitter link because I can't see anything without logging in.

    • __MatrixMan__ a day ago

      I ran across a thread once indicating that the guy who hosts that instance has a bit of a fight on his hands keeping it usable. Frequently fights hoards of bots, that sort of thing. Your signal probably got mistaken for some noise.

      We need Nitter and BitTorrent to have a baby, p2p so we can share the load.

      • alias_neo 20 hours ago

        Ah that would make sense because the CSS also didn't load in the page

        It's probably for the same reason I can't view many other sites like Reddit, Imgur etc, because our VPN exit IP is from a cloud hosting provider.

      • ForOldHack 21 hours ago

        They did, and it works well ipfs.

        • __MatrixMan__ 19 hours ago

          IPFS would be a good layer to build this on top of, but IPFS itself doesn't handle the correspondence between URL and CID. As it is, web DOMs are too variable to handle via content addressing.

          This would involve eliminating parts of the page that are unlikely to be stable (i.e. ads, pagination based on screen size, session specific details like usernames), and use that for the CID. That way the only time that users end up storing separate copies of the page is when it has actually been changed substantially. You'd probably also need some code to fluff the normalized content back up into something that makes sense for the device you're reading it on.

    • johnmaguire a day ago

      twiiit.com (three i's) will act as a round-robin to a working Nitter instance.

  • bilsbie a day ago

    [flagged]

    • bildung a day ago

      The twitter link doesn't work if you don't have an account, so I personally found the alternatives helpful.

    • leros a day ago

      X links are near useless if you don't have an account or are logged out since you can't view threads.

morphle a day ago

As this is HN, I'm curious if there is anyone here on HN who is interested in starting a business hosting these large open source LLM's?

I just finished a test of running this Deepseek-R1 768 GB model locally on a cluster of computers with 800 GB/s memory bandwidth (faster than the machine in the twitter post) and I can now extrapolate to a cluster with 6000 GB/s aggregate memory bandwidth and I'm sure we can reach higher speeds than Groq and Cerebras [1] on these large models.

We might even be cheap enough in OPEX to retrain these models.

Would anyone with cofounder or commercial skills be willing to set up this hosting service with me, it will take less than $30K investment but could be profitable in weeks?

[1] https://hc2024.hotchips.org/assets/program/conference/day2/7...

  • ForOldHack 21 hours ago

    The Azure(tm) and AWS version of rent-a-second are in the works as we speak. So yes, rent-a-brain/vegetable and no, I will bet you $40k you will not beat either AWS ot Microsoft to the punch. Zero chance of that. They will have their excess computational power with extremely discounted electric rates in place before Friday morning.

    • DrScientist 21 hours ago

      I wonder if the real market is actually bringing this stuff inhouse.

      Given the propensity for these big tech companies to hoover up/steal any information they can gather, running these models locally, with local fine tuning looks quite attractive.

      • rrix2 21 hours ago

        > Given the propensity for these big tech companies to hoover up/steal any information they can gather

        at the end of the day you still have to sell this product to the sorts of companies that are far and away all microsoft 365/google workspace clients and we're gonna have to figure that out one day or another

      • themanmaran 21 hours ago

        I'd wager this is the real market. Ship some company a server rack with Deepseek R1 for a $1M annual rental fee + upgrades to the latest models.

        think inside the box

        • morphle 20 hours ago

          Cerebras already does this.

          • spacemanspiff01 19 hours ago

            I thought cerebras had moved to a cloud model so that they could more easily manage/patch their systems?

            • morphle 19 hours ago

              both. The cloud model is also for renting models across several Cerebras Wafer Scale Integrations.

              • spacemanspiff01 19 hours ago

                Oh, they will deploy the racks at customers sites?

                • morphle 18 hours ago

                  I don't know what their policy is now, but they talk about it in the Hot Chips 2024 presentation.

                  I myself just have proof of a single customer having their own private Cerebras rack. There are rumors about several more customers with on-prem Cerebras.

      • valiant55 16 hours ago

        This is what has me most excited. AI has it's limited uses for now but with the current requirement of handing over all your data to big brother it was not even worth considering. Now that on prem is reasonable and doesn't require you to beg Nvidia for H100s it might actually be usable.

      • willseth 20 hours ago

        AWS already has platforms for running and fine tuning OSS models that can run privately inside a VPC. If Azure and GCP don’t have equivalent capabilities already, it is surely imminent. Seems pretty hard or impossible to beat cloud providers at their own game.

        • guiambros 12 hours ago

          > If Azure and GCP don’t have equivalent capabilities already, it is surely imminent

          GCP offers hundreds of models in its Vertex AI, including all "open source" (actually open weights) models, and the ability to fine tune for your specific needs. This blog post is from 2023 [1].

          (disclaimer: I work at Google, but not on the Cloud team)

          [1] https://cloud.google.com/blog/products/ai-machine-learning/s...

        • HeatrayEnjoyer 20 hours ago

          If the hardware isn't in your physical possession you can't know that your data isn't being hoovered up. You can't end to end encrypt compute tasks (homomorphic processing is fiendishly uneconomical).

          • TeMPOraL 20 hours ago

            True, but at this point we're leaving the realm of cryptography and theoretical infosec, and enter the realm of real-world security. In this realm, permissions are established by armies of lawyers across organizations and governments defining who can or cannot do things, and what happens when transgressions occur; here, "defense in depth" carries all the way to the threat of men with guns escorting you to jail.

            So it's true that you can't encrypt compute tasks of this type end-to-end, so you can't know if unauthorized parties mine your data. However, Microsoft is very unlikely to mine your data (for "you" being e.g. any of the many multinational corporations that already run all their office work through Azure-hosted Outlook, Office, SharePoint, etc.), or to let others mine it, because if it ever came out, your customers' lawyers would be after you, your lawyers would be after Microsoft, and the whole thing would explode into a multiple-billion-dollars shitshow and might even get a government or two involved.

            That's the working assumption that makes Microsoft well-positioned to eat any fledgling self-hosted DeepSeek market in the business space. They already have things set up at a level that is trusted by governments as well as corporations in critical industrial sectors, with huge financial and legal exposure.

            (Presumably Google and Amazon are in a similar position here, though I've only seen this personally with Microsoft/Azure, so that's what I can comment on.)

            • DrScientist 4 hours ago

              Unfortunately US law appears to compel American companies to share your data without your knowledge to the US government.

              Given the current US government is headed by a person that just looks to take what he wants - your assurances aren't comforting.

              • TeMPOraL 2 hours ago

                > Unfortunately US law appears to compel American companies to share your data without your knowledge to the US government.

                Sure, but that's not some unexpected gotcha - it's just a plain fact of geopolitical reality, managed by international treaties and accounted for in laws and contracts around the world. A multinational enterprise isn't like a person subscribing to a free plan of a random SaaS because the "sign up" button was the right shade of green - there are armies of lawyers on both sides, tasked with navigating applicable regulations (including GDPR and export control laws) and finding out a way to make things work.

                When they can't, the deal simply doesn't happen.

            • zie 18 hours ago

              > "defense in depth" carries all the way to the threat of men with guns escorting you to jail.

              For contract breach civil crime like this, there is zero chance it ends with jail time.

              • TeMPOraL 17 hours ago

                That's the typical case, true - but for many (most?) of the big multinationals, the worst case scenario for a hack involves people dying or some piece of critical infrastructure exploding.

                On top of that, "everything is securities fraud" - and since that does carry potential jail time, corporations generally try to avoid pissing off parties that would be able to frame a contract breach (and its consequences) in terms of investment fraud.

                EDIT:

                For starters, almost all data a multinational corporation generates and processes is subject to export control regulations, which are broad, full of special cases, vary over time, space and politics, and most importantly, violations of them come with huge fines and criminal penalties[0] for both businesses and individuals involved. The only reason Microsoft can get a corporation like this to migrate to O365 and run their back-office in Azure cloud is by solid, tested contractual guarantees that the data will be processed in ways that will keep the customer compliant with applicable regulations. Now, I'm not a lawyer, but it's not particularly hard to draw a line from "Microsoft snooping on enterprise customers" to securities fraud.

                I mean, even in context of hosting a DeepSeek derivative, we're talking about a cloud service offering enterprise customers secure training on company data. "Company data" may involve, e.g. detailed documentation or specs for software for designing advanced optical systems, which may sound benign until you make the connection[1]: "advanced optics" includes applications in advanced laser systems, which basically means weapons (e.g. ranging, missile targeting, anti-missile countermeasures). Obviously, regulators around the world (and the US in particular) would be very unhappy to see such information crossing through the wrong borders. For both the affected customers and the cloud service, this is high stakes game; a random startup isn't in a position to enter it.

                --

                [0] - E.g. in US, up to $1M per violation and up to 20 years in prison, possibly at the same time; see https://www.bis.doc.gov/index.php/enforcement/oee/penalties.

                [1] - This was a real intro example used in export control training I went through some years ago.

                • zie 16 hours ago

                  Yes technically you can go to prison for securities fraud, and everything could be securities fraud, if you have multiple share holders and play in that sandbox.

                  A small random startup is unlikely to play in the securities sandbox until they have enough resources to hire enough lawyers to keep themselves out of prison and the fines "reasonable"(i.e. not enough to incentivize actually doing something about the fine being imposed other than to at least temporarily stop doing the action).

                  When was the last time securities fraud ended in jail time by any S&P 500 company? My quick web search returned no instances ever(but I could be wrong).

                  • TeMPOraL 15 hours ago

                    Sure - but that's just, to the extent of our knowledge, regulations working as intended.

                    My point here is that OP's startup won't be able to compete with incumbents for enterprise money, and since the incumbents already provide this kind of service cheaply and reliably for customers of any size, all while handling applicable security concerns, OP's startup won't be able to compete with them for smaller customers either.

                    • zie 15 hours ago

                      Agreed, unless they can add a hook(some cool unique feature/thing to get them traction). It probably won't work out well.

                      • TeMPOraL 15 hours ago

                        FWIW, I think one possible hook would be to package up "training and deploying model on site" into a product - because after Azure, GCP and AWS, the next set of players best-positioned to make use of cheap frontier model training are... the very enterprise customers who would buy from aforementioned cloud providers instead of doing it themselves. Simplifying internal deployments could convince at least some of them to pay you instead of the Big Cloud.

          • threeseed 20 hours ago

            > you can't know that your data isn't being hoovered up

            There is no evidence of this happening in the last 20 years. None.

            And if there was it would be the complete unravelling of the entire cloud concept.

            So you're talking about solving a problem no one has.

            • DrScientist 4 hours ago

              It's not sure much about hoovering, as targeted spying.

              Plenty of evidence of companies and governments using spying for commercial/national ( sometimes the same ) advantage.

              So let's say you are a big company, and suddenly the US government decides you are a competitor in a nationally strategic industry - is your data safe if held by a US company?

      • antupis 21 hours ago

        Pretty much especially in Europe there is lots of big companies and public sector institutions that would pay serious € if they could run these.

        • morphle 20 hours ago

          Spot on! I concur most European business and public sector institutions would be eager to rent this because they are not allowed by law to use US datacenters like AWS or Azure.

          • kiviuq 20 hours ago

            That's not the only issue. They want a guarantee that the model wasn't trained on copyrighted material.

            • TeMPOraL 19 hours ago

              Now that is a real feature for now. A lot of hesitation in embracing generative AI in large enterprises stems from uncertainty about copyright issue. Anyone who trained an o1-level model from scratch on public/properly licensed data only would be able to provide a very valuable service to those enterprise customers.

              However, if both training and operating costs of a DeepSeek-like model are as small as they are, the companies best able to offer this service are... Microsoft, Amazon and Google. And second best are... teams inside the would-be customer enterprises themselves. $6M to train and $6K to run is effectively free for such companies; there is no moat here. The services that enterprise customers would happily buy instead of building are... operations, and assuming legal liability if the model turns out not to be safe from copyright infringement lawsuits. But those are exactly the services those companies are already buying from Microsoft, Amazon and Google.

            • fulafel 19 hours ago

              This would result in some refreshing models, I guess they would be trained mostly on out-of-copyright stuff from 75+ years ago and wouldn't have knowledge of the modern world.

              Maybe they could skin the robotic bureucrats in vintage scifi appearance as well to have the whole consistent experience when you go to the building permits bot, there could be small talk about the latest Beatles record etc.

              • TeMPOraL 15 hours ago

                Enforcing copyright on training data to this extent would actually create a temporary moat for the biggest players - they can afford to hire a lot of cheap labor to supplement the training dataset with human-authored original works that skirt IP protections by interpreting, parodying, commenting on or otherwise describing the protected works without actually infringing on them. As long as they keep those datasets private, everyone else is shit out of luck.

                (I'm reiterating my prediction wrt. AI and moats - the only mid-term moat there can be is in human labor. Hardware vendors benefit from selling better hardware to more people for less; software and research are cheap to scale, datasets eventually leak or get reproduced. Human labor is the one thing that doesn't scale, and except for an economic crisis, only ever gets more expensive with time. Whatever edge one can get by applying human labor that cannot be substituted by AI - like RLHF and its evolutions - is the one that will last all the way to AGI; past that, moats won't matter anymore.)

                One of the many reasons I'm firmly on the side of making the training of large neural models exempt of copyright considerations for everyone.

                • fulafel 10 hours ago

                  Isn't the training already exempt from copyright? Copyright is in the core about enabling licenses related to who's allowed to distribute copies of content (not ideas, but the exact same text, etc).

                  edit: apparently in the EU the situation is complicated by new AI specific legislation in the works: https://www.morganlewis.com/pubs/2024/02/eu-ai-act-how-far-w...

    • morphle 21 hours ago

      I think the important metric will be if we can compete against the price of AWS or Microsoft in running large LLMs, not their time to market. Competing on cost against overpriced hyperscalers is not very hard, and $30K is a small investment, not a gamble. If it would fail, worst case you would only loose $3000-$6500 or so.

      • DoingIsLearning 21 hours ago

        > $40K is a small investment, not a gamble. If it would fail, worst case you would only lose $3000-$6500 or so.

        As someone not familiar with investment sourcing or SME financing. Could you break down the maths/accounting? How do you go from sinking 40k in a business to losing 6.5k if you turn the lights off at the end?

        • morphle 21 hours ago

          You buy the hardware (48 servers), rent part of a colocation rack with a 10 Gbps or 100 Gbps internet transit link, get a payment processor, make a webpage and GitHub demo with the API. Break down: $3000 labour, $20.5K hardware, $800 monthly rental fees, $376 car fees. When you shut down within a year, the $20.5K popular off the shelf hardware can easily be sold for $17K, a fact you can check from 25 years of data.

          I would invest more than the initial $30K on optimization after the servers have found paying customers and thus have proven commercial viability. I would invest in software development, finetuning, retraining and above all reverse engineering GPU and neural engine instruction sets and adapting these open source models to the more than 2 quadrillion operations per second that these 48 servers can do.

          • seanp2k2 21 hours ago

            So, $0 budget for software dev / sales / support?

            • tonyhart7 19 hours ago

              Well, when you run an AI company, you must test your product, right? What better way to test it than by building your own webpage, admin panel, etc.?

            • morphle 21 hours ago

              I broke down the first $30K investment cost for release of the online API product, that does not need further software development, sales or support.

              You would be wise to do the software development I mentioned, do more sales and support than was covered under my initial $3000 labour fee. But that you can pay for with the revenues, it would not be the initial investment to see if it is viable as a business.

            • gloflo 20 hours ago

              That's what the AI is for, no? /s

          • lossolo 13 hours ago

            Where can you get 10 or 100 Gbps flat with a full 48U rack and power for $800?

            Because if that exists, I want to buy them all.

            • morphle 10 hours ago

              Email in my profile. 100 Gps plus rack is 5 times the cost of 10 Gbps. You'll need 100 Gbps routers and smartnics too.

              • menaerus 7 hours ago

                48 servers at 60K also does not math to me at all. Even if I consider the second-hand ones, I could hardly find zen4 under ~10K. And this is without GPUs.

  • rainclouds 20 hours ago

    Hardware requirements? I think I can hit the memory bandwidth building from parts I have in my house. Maybe even 2x. Asking for fun not profit.

    • morphle 19 hours ago

      I'd love to visit your house then. You have 768-1400 GB DRAM with 6000 GB/s memory bandwidth? Nice house.

      In my house I currently have almost 900 GB/S memory bandwidth in aggregate but only 132 GB total DRAM.

      • rainclouds 14 hours ago

        I’ve got a terabyte of DDR4, and a bunch of old thread rippers. They can take 256 each and 8 channel.

        • morphle 10 hours ago

          Yep, that's the right stuff. Now simply cross-connect all the free PCIe lanes of all thread rippers and you have a nice cluster for LLMs.

  • ryao 20 hours ago

    Did you implement token generation for Deepseek R1 using PBLAS?

horsawlarway a day ago

6 to 8 t/s is decent for a GPU-less build, but speaking from experience... it will feel slow. Usable - but slow.

Especially because these models <think> for a long bit before actually answering, so they generate for a longer period. (to be clear - I find the <think> section useful, but it also means waiting for more tokens)

Personally - I end up moving down to lower quality models (quants/less params) until I hit about 15 tokens/second. For chat, that seems to be the magical spot where I stop caring and it's "fast enough" to keep me engaged.

For inline code helpers (ex - copilot) you really need to be up near 30 tokens/second to make it feel fast enough to be helpful.

  • UncleOxidant 20 hours ago

    Is a token a character or a word? I don't think I can read 30 words/second so 8/second would be fine. But if it's characters then yes, it would be slow.

    • horsawlarway 18 hours ago

      It varies a lot by model, tokenizer, and language.

      For DeekSeek 1 token ~= 3 English characters.

      See: https://api-docs.deepseek.com/quick_start/token_usage/

      It comes out to around 1-3 words/second. This is not so slow that it's maddening (ex 2 token/second is frustratingly slow, like walk away and make coffee while it's answering slow), but it's still slow enough to make it hard to functionally use and not break flow state. You get bored and distracted reading at that pace.

    • ryao 20 hours ago

      It typically somewhere between 3 to 5 characters.

  • SkyPuncher 18 hours ago

    I've been with RooCode the slower token speeds are acceptable.

    I'll ask it to do a task, then it will do it in a few steps and notify me when it's done. I use that time to take care of something else.

  • k__ a day ago

    Would DeepSeek-V3 be cheaper?

    • sigmoid10 a day ago

      Only in the sense that it will generate fewer tokens.

  • guerrilla 21 hours ago

    This is the same I get from a Core i5-9400 for smaller models. Is there no prosumer board that can take that much RAM? There must be a ThreadRipper that can do it, right? Why did he need EPYC?

    • hollerith 21 hours ago

      It's not just the amount of RAM: RAM bandwidth also matters.

    • UncleOxidant 20 hours ago

      RAM channels is key here. That board has 24 channels of DDR5 RAM. A lot of lowend boards only have 2 channels. Most come in at 4. Some high end boards have 8, so 24 channels is quite wide.

ryao 21 hours ago

A better setup on paper would be this CPU and motherboard combination with 12x64GB DIMMs:

https://www.newegg.com/p/N82E16819113866

https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-se...

As for memory, these two kits should work (both are needed for the full 12 DIMMs):

https://www.newegg.com/owc-256gb/p/1X5-005D-001G0

https://www.newegg.com/owc-512gb/p/1X5-005D-001G4

Since it would be a 2DPC configuration, the memory would be limited to 4400MT/sec unless you overclock it. That would give 422.4GB/sec, which should be enough to run the full model at 11 tokens per second according to a simple napkin math calculation. In practice, it might not run that fast. If the memory is overclocked, getting to 16 tokens per second might be possible (according to napkin math).

The subtotal for the linked parts alone is $5,139.98. It should stay below $6000 even after adding the other things needed, although perhaps it would be more after tax.

Note that I have not actually built this to know how it works in practice. My description here is purely hypothetical.

  • menaerus 6 hours ago

    I am not sure if that would work well since 8 cores is really small, it can't really scale well wrt to attention head algorithm, and more importantly this particular EPYC part can only achieve ~50% of the theoretical memory bandwidth, ~240 G/s. Other EPYC parts are running close to ~70%, at ~400 G/s, which OP is using.

  • bildung 21 hours ago

    I think the point of the two socket solution is the doubled memory bandwith. You propose using just a single one of the same CPU, or am I missing something?

    • ryao 20 hours ago

      llama.cpp’s token generation speed does not scale with multiple CPU sockets just like it does not scale with multiple GPUs. Matthew Carrigan wrote:

      > Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!

      This does not actually make sense. It is well known that there is a penalty for accessing memory attached to a different CPU. You don’t get more bandwidth from disabling the NUMA node information and his token generation performance reflects that. If there was a doubling effect from using two CPU sockets, he should be getting twice the performance, but he is not.

      Additionally, llama.cpp’s NUMA support is suboptimal, so he is likely taking a performance hit:

      https://github.com/ggerganov/llama.cpp/issues/11333

      When llama.cpp fixes its NUMA support, using two sockets should be no worse than using one socket, but it will not become better unless some new way of doing the calculations is devised that benefits from NUMA. This might be possible (particularly if you can get GEMV to run faster using NUMA), but it is not how things are implemented right now.

      • freeqaz 19 hours ago

        Do you get more bandwidth at the cost of latency?

        Also how much would stuffing a GPU or 3 (3090/4090) improve speeds, even with heavy CPU layer offloading, or would the penalty be too big? I know in some cases you're swapping data into the GPU, but in others you're just doing parts on the CPU. I'm curious what the comparison for speed would be.

        • ryao 18 hours ago

          I would suspect the infinity fabric links are already saturated with the local RAM’s memory bandwidth such that you will not get more by accessing another socket’s RAM.

          Chips and Cheese suggests things are even worse than this as the per CCD bandwidth is limited to around 120GB/sec, which probably ruins the idea of using the 9015, as that only has 2 CCDs:

          https://old.chipsandcheese.com/2024/10/11/amds-turin-5th-gen...

          https://www.techpowerup.com/cpu-specs/epyc-9015.c3903

          Anyway, leveraging both sockets’ memory bandwidth would require splitting the layers into partitions for each NUMA node and doing that partition’s part of each GEMV calculation on the local CPU cores. PBLAS might be useful in implementing something like that.

          As for a speed up from using 3090/4090 cards, that is a bit involved to estimate. The model has 61 layers. The way llama.cpp works is that it will offload layers and the computation will move from device to device depending on where the layers are in memory. You would need to calculate roughly how long it takes for each device to do a layer. Then multiple by the number of layers processed by that device and sum across the devices. Finally, normalize to get the number of tokens per second and you will have your answer. DeepSeek R1 has 61 layers (although I think llama.cpp will say 62 due to the embedding layer if it counts for DeepSeek like it does for llama 3). It has 37GB of activated weights, so you can do 37GB / 61 / memory bandwidth to get the time per layer. You probably want to multiply by 1.25 as a fudge factor to account for the fact that these things never run at the full speed that these calculations predict. Then you can plug in these numbers into the earlier calculation I described to get your answer.

999900000999 a day ago

I'm so hyped for this.

It's going to take some time, but the farce is gone. We'll have parity to Chat GPT on consumer hardware soon enough. 6k is still too much. I suspect the community will be able to get this down to 2K.

I'm tempted to cancel my Chat GPT subscription!

  • binary132 20 hours ago

    My theory is that in the future we will have much more “friends circle cloud” type ops, where that hardware cost is spread out among a small community and access is private. What it won’t look like is every Tom, Dick, and Harry running their own $10k hardware to have a chuckle at the naughty jokes and grade-C+ programmer IDE assistance offered by open-source LLMs.

  • seanp2k2 21 hours ago

    Orders of magnitude make the difference. “Consumer-level” will be once it’s around $300-700ish in a nice little box like an Intel NUC or similar. Ubiquiti just did this with their AI-Key to help classify stuff on their video camera surveillance platform.

    • 999900000999 21 hours ago

      2K is reasonable.

      Not every single person needs to have it. But if someone in your circle has the needed hardware...

mysteria a day ago

Would adding a single GPU help with prompt processing here? When I run a llama.cpp GPU build with no layers offloaded prompt processing is still way faster as all the matrix multiplies are done on the accelerator. The actual memory bound inference continues to run on the CPU.

Since tensor cores are so fast you still come out ahead when you send the weights over PCIE to the GPU and return the completed products back to main memory.

  • ryao 21 hours ago

    Maybe. It likely depends on whether GEMM can operate at/near full speed with streaming weights over PCI-E. I am not sure how to stream the weights over PCI-E for use by CUDA/PTX code offhand. It would be a R&D project.

permanent a day ago

-- copied

Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:

Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth. https://t.co/GCYsoYaKvZ

CPU: 2x any AMD EPYC 9004 or 9005 CPU. LLM generation is bottlenecked by memory bandwidth, so you don't need a top-end one. Get the 9115 or even the 9015 if you really want to cut costs https://t.co/TkbfSFBioq

RAM: This is the big one. We are going to need 768GB (to fit the model) across 24 RAM channels (to get the bandwidth to run it fast enough). That means 24 x 32GB DDR5-RDIMM modules. Example kits: https://t.co/pJDnjxnfjg https://t.co/ULXQen6TEc

Case: You can fit this in a standard tower case, but make sure it has screw mounts for a full server motherboard, which most consumer cases won't. The Enthoo Pro 2 Server will take this motherboard: https://t.co/m1KoTor49h

PSU: The power use of this system is surprisingly low! (<400W) However, you will need lots of CPU power cables for 2 EPYC CPUs. The Corsair HX1000i has enough, but you might be able to find a cheaper option: https://t.co/y6ug3LKd2k

Heatsink: This is a tricky bit. AMD EPYC is socket SP5, and most heatsinks for SP5 assume you have a 2U/4U server blade, which we don't for this build. You probably have to go to Ebay/Aliexpress for this. I can vouch for this one: https://t.co/51cUykOuWG

And if you find the fans that come with that heatsink noisy, replacing with 1 or 2 of these per heatsink instead will be efficient and whisper-quiet: https://t.co/CaEwtoxRZj

And finally, the SSD: Any 1TB or larger SSD that can fit R1 is fine. I recommend NVMe, just because you'll have to copy 700GB into RAM when you start the model, lol. No link here, if you got this far I assume you can find one yourself!

And that's your system! Put it all together and throw Linux on it. Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!

Now, software. Follow the instructions here to install llama.cpp https://t.co/jIkQksXZzu

Next, the model. Time to download 700 gigabytes of weights from @huggingface! Grab every file in the Q8_0 folder here: https://t.co/9ni1Miw73O

Believe it or not, you're almost done. There are more elegant ways to set it up, but for a quick demo, just do this. llama-cli -m ./DeepSeek-R1.Q8_0-00001-of-00015.gguf --temp 0.6 -no-cnv -c 16384 -p "<|User|>How many Rs are there in strawberry?<|Assistant|>"

If all goes well, you should witness a short load period followed by the stream of consciousness as a state-of-the-art local LLM begins to ponder your question:

And once it passes that test, just use llama-server to host the model and pass requests in from your other software. You now have frontier-level intelligence hosted entirely on your local machine, all open-source and free to use!

And if you got this far: Yes, there's no GPU in this build! If you want to host on GPU for faster generation speed, you can! You'll just lose a lot of quality from quantization, or if you want Q8 you'll need >700GB of GPU memory, which will probably cost $100k+

  • ijk a day ago

    I'd assume that the existing llama.cpp ability to split layers out to the GPU still applies, so you could have some fraction in VRAM and speed up those layers.

    The memory bandwidth might be an issue, and it would be a pretty small percentage of the model, but I'd guess the speedup would be apparent.

    Maybe not worth the few thousand for the card + more power/cooling/space, of course.

  • SlavikCA 21 hours ago

    2x CPU system may be slower for LLM than 1x CPU system.

    Because in 2x CPU system, the model may have to be passed via NUMA, which has 10% - 30% of memory bandwidth bandwidth

UncleOxidant 20 hours ago

> "Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000."

These guys say they got really good results with 2.51-bit quantization of the original R1. The original has 671B params weighing in at 720GB - that's what they're running on this $6000 setup. According to these guys[0] they get really good results at 2.51bit quantization which would be this model[1] which is still 671B params, but weighs in at 212GB.

[0] https://unsloth.ai/blog/deepseekr1-dynamic [1] https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De...

  • segmondy 18 hours ago

    That's not what they are running. the dynamic quants are Q1, Q2. The $6000 is running Q8.

    • UncleOxidant 18 hours ago

      Yes, exactly. What I'm saying is that the blog referenced in [0] says you can get good results with 2.51 quantization. The $6000 rig is running Q8. You can probably get similar results with a lesser rig if you use the quantized model.

ComputerGuru a day ago

CPU-only is, very unfortunately, infeasible for reasoning models. This setup would be great for deepseek v3 or (more fittingly) the 405B llama 3.1 model, but 6-7 tokens per second on a reasoning model is 100% getting (well) into seconds-per-token territory if you consider only the final answer.

(You don’t have to take it from me: if CPU were good enough, AMD’s valuation would be 100x its current value.)

  • samvher a day ago

    Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?

    • ryao 20 hours ago

      The answer to your question is yes. There is an open issue with llama.cpp about this very thing:

      https://github.com/ggerganov/llama.cpp/issues/11333

      The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.

      If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.

    • snovv_crash 21 hours ago

      No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.

      • menaerus 6 hours ago

        How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

    • telotortium 21 hours ago

      Maybe a little, but FLOPs and memory bandwidth don't lie.

  • brandall10 19 hours ago

    "This setup would be great for deepseek v3 or (more fittingly) the 405B llama 3.1 model"

    v3 yes w/ 37B activated params, yes, but terrible on 405B as it's a dense model.

  • qingcharles 21 hours ago

    It honestly depends on your use case. I often run bigger, slower models on my PC and let them just tootle along in the background grinding out their response while I work on something else.

    • grahamj 21 hours ago

      Yeah this is what I was thinking, or maybe use smaller models to work on the prompt then fire it off to the biggie while you do something else.

drooby a day ago

The amount of compute required for R1 to determine that strawberry has 3 Rs is hilarious..

I'm not so convinced that the Nvidia panic is justified.

  • ceejayoz a day ago

    The Nvidia panic stems, in part, from the possibility that this is just the first of many potential significant optimization leaps.

    • drooby a day ago

      Idgi, optimization leaps are virtually guaranteed.

      I think this is a deeper question about the bounds of human desire... which seem virtually limitless. We seem to have an unlimited appetite for answering questions on the complexity of existence. Pair that with arms race issues, and you have an obvious need for massive compute regardless of how efficient the algorithms are.

      • kovacs 21 hours ago

        That's fair but look at the assumptions currently built into NVDA's stock price. Hard to say for certain but to some it's priced as if it has a monopoly on all things AI for a decade. If you can spend $6K (and less in the future) on a full system that runs a model for you where does that leave the assumptions baked into that stock price? I'm too lazy to have come up with a model myself but off the cuff it seems like there might be some dislocation in a lot of assumptions. I dunno. I'm kind of a luddite because of the .com bubble and this has the same feel.

  • layer8 a day ago

    I wondered how I should feel about that thought process. However, I might have a similar thought process when inebriated.

  • disgruntledphd2 a day ago

    I mean, to be fair, given the way tokenization works, this isn't that surprising.

    I do agree that it's very funny though.

    • cookingrobot a day ago

      I think it’s like asking someone “how many times does your pen change direction when writing the word strawberry on paper”.

      You need to think really hard to get to an answer, because that’s more fine grained than the way you usually think about words and letters.

  • UncleEntity 20 hours ago

    > The amount of compute required for R1 to determine that strawberry has 3 Rs is hilarious..

    How about the amount of compute to take a buggy hextree implementation I've been poking at for a few years and completely rewrite it into a fully functioning implementation (with test cases) without me having to write a single line of code? Well...other than me having to break out the printf debugger to help with tracking down some deep bugs that is.

    I've been trying to get the original implementation to work correctly for so long I don't even remember what I wanted to use it for in the first place, I just mess with it a bit here and there when I have nothing better to do.

    And that's just me, a half-assed self-taught junior woodchuck coder, playing around to see what all the hype is about.

    I can only imagine that I'm giving them valuable training data as some of the bugs were very deep and took a whole lot of 'thinking' to track down the root cause. It does take a bit of prodding to get it to look in the right place but so far it has found and fixed them all. The last bug was an overflow in the tree iterator's stack it uses to track state across iterations that I was concerned would time out as it was thinking for a long, long time.

    I'm not really one to defend the robots but this one is actually uesful.

    --edit--

    Oh... I guess that's a meme now. Nothing to see here, move along...

swiftcoder a day ago

Does anyone have the performance delta between running this on a 768 GB setup like this where the whole thing fits in RAM, versus running it on an M4 Mac with the maxed out 128 GB?

  • oynqr 17 hours ago

    Running the ollama 671b 4 bit quant on a 7950X3D with 128GiB RAM, I get like 1-2 t/s.

  • qingcharles 21 hours ago

    My other question is.. can you jam 768GB of RAM into an M2 Ultra Mac Pro?

    And M4 Ultra Mac Pros are probably only weeks away too.

    • twoodfin 21 hours ago

      No, you cannot. All M-series Macs to date—including the Mac Pro—have RAM fixed at manufacturing.

    • qingcharles 21 hours ago

      OK, well, turns out you can't upgrade the RAM in the M2 Mac Pros and they top out at 192GB spec from factory, so that's that.

      • ryao 21 hours ago

        It might be possible to desolder the chips and solder larger capacity ones, but it is a risky thing to do, especially since there is no guarantee that it will work out.

        That said, a similar upgrade has been done on the raspberry pi 4, so it is theoretically possible:

        https://hackaday.com/2023/03/05/upgrade-ram-on-your-pi-4-the...

nexus_six a day ago

What would the context length look like for this setup? How quick would the 6-7 tps degrade once you hit say 20k tokens?

JonChesterfield a day ago

Working through this now. The directions are to download the contents of the Q8_0 at https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main. That turns out to be a git lfs repo. `git clone` followed by `git lfs pull` is downloading all of it which will have to do. (fetchinclude = DeepSeek-R1-Q8_0 seems to be limiting it to the directory of interest). If there's a cleverer way to get the files please reply - I looked for a torrent and failed to find one.

Not completely clear what changing numa nodes per socket from 1 to 0 does, possibly gives linux less information about when to migrate threads across x64 cores? (didn't upset llvm compile time so I'll leave it on nsp0)

  • kgwgk a day ago

    Can’t you use the download icons in each object at https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De... ?

    • JonChesterfield 21 hours ago

      The icons resolve to things like https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main... which wget understands. Presumably there's a greater-than-average risk of corruption in transit when the files are big and git does some sort of integrity checking that one would lose out on? It's the verify-local-data feature I'm really missing from torrent here.

      • Bognar 15 hours ago

        It's HTTPS. Not only are you getting checksumming from TCP, but any block that had bitflips would fail TLS decryption and would fail the entire transfer. You're not going to see silent corruption in transit.

      • ryao 20 hours ago

        TCP checksums usually prevent corruption, unless the data is already corrupt prior to the checksum computation (which does happen rarely).

      • kgwgk 20 hours ago

        I’ve never seen any « corruption in transit » when downloading big files.

sheepscreek 21 hours ago

Update: Scratch that. Two of them together would only be able to run something half as big (~400B parameters) and cost as much as this rig. Maybe the next gen of DIGITS could do it. Keeping that in mind, this rig is pretty darn impressive for $6k!

This is going to drop by half when Nvidia starts shipping DIGITS. I think we’re all going to want one. It’ll probably have a much bigger impact than Apple VisionPro, that costs the same.

I can already think of using it as a much more intelligent local Siri/Alexa to control devices. It’s something that can actually keep the kids engaged with useful trivia/knowlege (better than watching mindless trash on YT) or it can just humour me whenever I want - all without needing to worry about privacy.

indeed30 a day ago

So, can somebody in the know speculate about how Deepseek (or OpenAI, or whoever really) is actually running their API?

If I wanted to run a production-grade service using the full Deepseek model, with good tokens/sec and the ability to serve concurrent requests, what sort of hardware are we looking at?

  • MurkyLabs a day ago

    Racks and Racks of servers (likely nVidia HGX H100/H200 8-GPU server) connected at at least 100GB (but more likely 400gb and 800gb) links. The servers alone start at about $350k. Then you need to supply power, cooling, networking and a technical team to support the program.

niwtsol a day ago

Is anyone aware of a site that shows various builds or off the shelf systems (new and old) and how they handle various models? Like I’d love to see the above vs a Mac Studio 198gb vs an old m1 studio vs other models. I don’t have $6000, but what is a good happy medium I could get to on a used system with a smaller model.

sgt101 a day ago

Hmm - if only macpro's had memory that could be upgraded like in the olden days.

  • qingcharles 21 hours ago

    Wow, I had to go confirm this for myself. This sucks. They max out at 192GB on the current M2 line, which is absurd.

    Hopefully this year's M4 Ultra systems will at least allow a much higher top spec.

  • aurareturn a day ago

    Part of the reason it has high memory bandwidth is that it is soldered.

    • menaerus 6 hours ago

      M1 Max for example has ~240 GB/s of memory bandwidth. Do you think this is because they're using 256-bit LPDDR5@6400MT/s part in combination with not sure how many (?) memory channels or because they soldered it to the chip?

      • sgt101 6 hours ago

        hmm that's a hard one - let's ask Deepseek!

  • SV_BubbleTime a day ago

    If it could, it wouldn’t be so tightly coupled and then identical to everything else out there.

    • sgt101 19 hours ago

      I wonder if that's really set in stone or just a stance that suited Apples marketing for a few years. Maybe there's a business case for some bigger memory Macs now!

    • bitwize a day ago

      It also wouldn't be as fast. Soldered RAM is faster than socketed RAM. RAM that is in the CPU package is faster still.

      Apple is going in the direction of total integration. Louis Rossmann and iFixit will hem and haw, but soon there will be a MacBook whose motherboard, besides cooling, PSU, and ports, consists simply of a single component that houses CPU, GPU, RAM, storage, I/O port controllers, radios (wifi, Bluetooth, etc.), firmware for all of the above, and a security module plus keys, all directly on the CPU bus, and it will be glorious. It will absolutely lap any PC laptop, and in single-core performance will smoke even high-end AMD Epyc beastbox builds because of the aggressive elimination of inter-component latency.

      But there won't be a need to fix it. If it breaks you just recycle it and buy new, being sure to sync your data back from Apple Cloud -- but it probably won't break. Kinda like how unibody cars are both safer and more reliable than the much more fixable cars of the 60s, even if they crumple like tinfoil and must be totaled upon experiencing any sort of impact.

wg0 a day ago

Or - Have your own "OpenAI" at home. Train, fine-tune, distill in the cloud if and when necessary.

Now this is basically the moment of Apache/ngnix being free and open source.

You then have share hosting phenomenon out of it.

monobot12 a day ago

If you don't mind a speed of 1 token per second, you can run the largest R1 model on a 2021 iMac, as I just did.

  • btbuildem a day ago

    Largest R1, as in the 671B? How do you accomplish that feat?

    • oynqr 17 hours ago

      Just do it? Llama.cpp doesn't load the entire thing into ram. It mmaps the file and the kernel takes care of the rest.

  • jeffbee a day ago

    Are we speaking of a 2020-edition Intel 27" iMac or a 2021 M1?

mrbonner 21 hours ago

Is there a build that would allow me to run llama 3.3 local? Something around $2500 or below.

  • wmf 21 hours ago

    If you're talking about 3.3 70B Q4, any PC with 64 GB RAM could run it.

    • mrbonner 21 hours ago

      I can use RTX to speed the inference up. My budget is up to $2500.

      • wmf 17 hours ago

        Maybe you could run some layers on a 3090; I'm not sure how much speedup it would give.

kfcjligmom a day ago

Does it really take these things that long to tell me how many Rs are in strawberry?

  • the_sleaze_ a day ago

    behind the internet bubble was the modern internet.

  • recursive 21 hours ago

    Sick burn bro

    • kfcjligmom 16 hours ago

      It's a genuine question. I haven't used AI before. The linked video is maybe the first time I've seen it in action. I'm underwhelmed.

      Is this at all a fair representation, or did it lock up or something? Or is it particularly bad at this type of question for some reason?

      Surely this can't be the mighty AI that has the whole world going bananas..?

      • recursive 16 hours ago

        I can't find the video, but I have an idea what's going on. The quality of the output from these things is very inconsistent. Sometimes it seems to have surprising "insight". Sometimes it's incoherent nonsense. People that want to be impressed cherry-pick the good results. People that expect things to "just work" notice the duds more. You can find very good and very bad results. If you're a starry-eyed technologist, you'll publicize the good outliers because of the potential they represent. If you're a skeptic, you'll point out the seemingly brain-dead failures.

Jotalea a day ago

My Ryzen 7 3700U is giving it all with the 7b model.

42772827 a day ago

Is deepseek r-1 as fashioned here censored?

erichocean 21 hours ago

Unless I'm missing something, the full 1TB bandwidth isn't used because the memory layout is wrong.

But that's fixable.

Since it's memory bound, it might be possible to reach 15 tok/sec with this build.

  • r14c 20 hours ago

    I'm curious, what's wrong with the memory layout? You mean the ollama settings?

m3kw9 a day ago

This is more like an experiment more than practical use with the stated 6 toks/s. Paying 6gs for that and days of setup when the next model may come in a month

_giorgio_ a day ago

> the generation speed on this build is 6 to 8 tokens per second

> ...if you want Q8 you'll need >700GB of GPU memory, which will probably cost $100k+

  • alias_neo a day ago

    I clicked through because that $6000 price tag seemed insane, achievable even.

    Now it makes sense.

    Still undecided how I feel about having the ability to use all that quality in the full size model if one could only retrieve it at 6-8 tokens per second.

tantalor 21 hours ago

Off topic, but I'd be fine with not seeing direct links to Twitter on here anymore. It's not very useful or user friendly. Similar to Pinterest, Instagram, or TikTok.

Screenshots or mirrors (without the login requirement) are okay.

ForOldHack a day ago

$6000. Oh that's impressive concedering my gaming PC cost all of $1100. What a deal! Can I get two? Or three? How about training? Is that free? Air? Is air free? Does the economic model ( profit, profita and profit ) depend on stealing published works? Is that intelligent?

Once the brain dead greedy MBAs get involved, is just how much you can steal. It should all be sold short, as we watch the world burn.

  • culi a day ago

    Hope you're doing okay...

    $6k is much less than the millions that would be required to run anything by OpenAI. And it's a first pass. It could get much lower by the end of the year

ssahoo a day ago

I have been using deepseek-r1:1.5b/8b on a MacBook m1 pro max. The performance has been pretty good, at par with the o1. So last 2 days i have been running them side by side. I'm satisfied with the results and performance. That's a not very cheap 2k hardware and it does the job.

  • kgwgk a day ago

    It does the job of running something which is derived from - but it’s not - the deepseek r1 model. They say the real thing is way better (as it should).

  • jibbers 20 hours ago

    M1 Pro or M1 Max?