Compute Scarcity Scenarios || Finite Ape

Yes, I know; I became another economist talking about AI.¹ I’ll try to keep it short.

Alex Imas recently posted that we’ve been living through a period of abundant AI and should get ready for compute scarcity, rationing, multi-tiered markets, new token allocation infrastructure.

I really don’t think we’ll run into a compute scarcity issue anymore² but I was also worried about this until ~12 months ago or so; it was clear that the big labs wouldn’t be able to keep burning cash³ and there weren’t really comparable alternatives. Since there is also no sign of the frontier labs becoming profitable at the given price ranges, if we stayed on that path, it was pretty clear that we’d run into some problem in the near future.

Since then, open weight models have been improving steadily. There were already open weight models back in 2023-4 but those were too large to run on consumer GPUs or not good enough to be really useful for coding (in my experience). The gap between the best proprietary model and the best open weight model is generally believed to be 3-6 months⁴, so, not a big deal. However, since last year, small models that we can run on our own computers also became serious competitors, especially for coding tasks, which is one of the more obvious and popular LLM use cases.

The tooling is still not as good as Claude Code (although there are open source forks) so it is not as pleasant to use them yet. But it turns out, for daily coding tasks, we don’t really need trillions of parameters. Qwen’s much smaller models work just fine even though you might need to be a bit more involved (which is not necessarily a bad thing).

This is just the coding side though. If you just want to ask a model about anything, currently, you either need a large model with a good knowledge base or at least very good search capability. There are also open source improvements in the search space; just yesterday Chroma released a new model that is trained on agentic search tasks.⁵ Unless we can somehow compress the large model capabilities fully into small models, I can see two directions from here: Either we’ll get lots of small specialized models or small models that are very comfortable with doing research. Both of them would be slower to work with but then again, they can be free or cheap, and we can set them up so that they do most of the work when we are not looking. And for many tasks, cheaper and slower is a perfectly acceptable trade. Besides, I am sure the open weight models, GPUs, and the open source ecosystem that run these models on those GPUs will keep getting better, at least for a while.

I tried Qwen and even on my (personal) computer, which is fairly old at this point, it runs with an okay TPS and the quality is different but not bad. I basically use them only for weekend projects, mostly to see how well they perform, and I am really impressed. To be clear, given the choice, I obviously prefer Claude but I think it is mostly tooling that makes it worth it for now. There are also other short-term advantages of Claude Code: (i) Getting used to a particular model’s behavior and being able to predict it and (ii) just more general knowledge. But for coding specifically, I think the gap is much closer than other areas (including economics).

So, for coding, it might make sense to buy local GPUs or Mac Studio/Mini⁶ and I think in the long run, the scarcity problem will be often solved on the consumer side. I only talked about coding above but I also use local models for OCR, translation, etc. and they are amazing; especially Qwen-3.5-27b works very well on my local GPU.

By the way, despite all its benefits, local inference is inefficient. We are running local models on our GPUs inefficiently, when compared to the parallel processing and optimization that happens on cloud.⁷ This increases power usage but it also means millions of GPUs are idle / sleep for most of the time. Given the inefficiency, you’d expect that someone would start clearing the market. This is already happening to some extent: You can rent out your GPU when you are not actively using it or you can just use third party APIs to do inference with smaller models. However, I think the efficiency gap is still fairly large. For most people, I suspect that using the third party APIs for inference on open weight models wouldn’t make a lot of sense; you are getting the open weight quality, still have no privacy, and the price is not that great. But we spent a few decades on designing “centralized decentralized marketplaces” (Uber, AirBnb, etc.) so we’ll see how the market plays out in the next few years.

I also didn’t even talk about edge computing but once the RAM market adjusts production and pricing, and the cellphone market in turn responds to the demand for on device computation, I think we’ll also be doing a lot of computation on our mobile devices.

So the scarcity scenarios assume that the frontier labs are the only source of tokens but even refrigerators, toasters and microwaves are doing their best to prove that wrong.

I’ve been somewhat unintentionally maintaining my anonymity on this blog, especially if you ignore the first post, and now I’ve outed myself as an economist. Oh well. ↩︎
Well, if the war continues and the energy prices keep increasing, we can have some compute scarcity due to energy scarcity. But that’s a geopolitical problem, not related to AI or the labs, or anything like that. ↩︎
They still can’t, really. OpenAI is projecting $14 billion in losses for this year alone. Anthropic is growing revenue faster than anyone but also spending roughly what it earns. If anything, this makes the case for open alternatives stronger, not weaker; the current API pricing is subsidized by venture capital and there is no guarantee it stays this low. ↩︎
This is actually true for the larger models but a bit optimistic for the ones that actually fit on your GPU. Epoch AI puts the top-line number at ~3 months. For models you can run at home, it’s more like 6-12 months depending on what you’re doing with them. But the consumer-tier gap is shrinking faster than the top-line gap, if you are following the benchmarks (although, the only benchmark I trust at this point is my own experience or the experiences of a handful of people). ↩︎
Context-1. I haven’t had the chance to try it yet but the idea of a retrieval model specifically designed for agentic workflows is exactly the kind of thing that could make small local models much more useful for non-coding tasks. ↩︎
As much as I hate to say this, the Mac Mini (or Studio) with 64GB unified memory is probably the most interesting option right now. It draws about 30-40 watts running a 32B model, which is roughly what an old style lightbulb uses, compared to the space heater experience of running an RTX 5090. (By the way, I wouldn’t exchange RTX 5090 for even 256GB unified memory on a Mac; I love my GPU as it can train much, much faster than a Mac. But, I wouldn’t mind having another machine where I can run larger models, since the GPU VRAM is a hard limit in the NVidia world, at least for now.) ↩︎
A Stanford study quantified this: cloud inference is somewhere between 1.4x and 7.4x more energy-efficient per query, depending on the hardware. I suspect in the cause of the realistic naive usage of the local models by less technical users, the efficiency gap is much, much larger. Most of the advantage comes from batching; when you serve many users at once, the cost of loading the model into memory gets amortized as well as the fixed cost of having the GPU running. Your personal GPU, on the other hand, sits idle most of the day doing nothing for anyone. ↩︎