Hacker Newsnew | past | comments | ask | show | jobs | submit | storus's commentslogin

Maybe try NAD+ boosters next? They seem to be reducing addictive behaviors quite a bit. Liposomal NAD+ or nicotinamide riboside or IV NAD+. The theory is an energetic deficit in the brain that drugs/addictions seem to override temporarily but deepen long-term and NAD+ is essentially bringing the energy back. Maybe GLPs do something similar due to flooding the body with broken down fat?

Do GLPs flood the body with broken down fat? I thought they just suppressed appetite and the like.

People overstate some of the secondary effects, but in a nutshell that’s more or less what they do.

Metabolic dysfunction is the root of many diseases which addiction is one of them .

Strix Halo has awful token prefill speed. Only suitable for very small contexts.

Basic datacenter technicians will be the new astronauts, swapping burnt CPUs and failed hard drives in space.

Strix Halo can only allocate 96GB RAM to the GPU. So GPT-OSS 120B can be ran only at Q6 at best (but activations would need to be partially stored in the CPU mem then).

It can use only 96GB RAM on Windows, on Linux people have allocated up to 120GB. Here's one source: https://www.reddit.com/r/LocalLLaMA/comments/1nmlluu/comment...

GPT-OSS 120B uses native 4 bit representation, so it fits fine.

I bet you're confusing VRAM (the old fixed thing) and GTT (dynamic) memory allocation. Linux amdgpu does GTT just fine. amdgpu_top is an example monitoring app that shows them separately.

More: https://news.ycombinator.com/item?id=44859582


>Strix Halo can only allocate 96GB RAM to the GPU.

Are you referring to exclusive or shared allocation? I think shared allocation allows using all available memory.


That's not really true, the latest autoregressive image models create a codebook of patches that are then encoded as tokens and image is assembled out of them.

That won't work at elite schools like Stanford where a hard class average is like 98% and 94% will give you B+ due to the opposite curve being applied.

I went to Stanford and that was absolutely not the case. I once got an A on a midterm with a 65%

What I mentioned was the case in some hard CS classes I took there.

Wouldn't this restrict memory to 128GB, wasting M3 Ultra potential?


Blog author here. Actually, no. The model can be streamed into the DGX Spark, so we can run prefill of models much larger than 128GB (e.g. DeepSeek R1) on the DGX Spark. This feature is coming to EXO 1.0 which will be open-sourced soonTM.

Excellent! Good luck!

M5 is supposed to support FP4 natively which would explain the speed up on Q4 quantized models (down from BF16).


DGX Spark is not for training, only for inference (FP4).


M3 Ultra has slow GPU and no HW FP4 support so its initial token decoding is going to be slow, practically unusable for 100k+ context sizes. For token generation that is memory bound M3 Ultra would be much faster, but who wants to wait 15 minutes to read the context? Spark will be much faster for initial token processing, giving you a much better time to first token, but then 3x slower (273 vs 800GB/s) in token generation throughput. You need to decide what is more important for you. Strix Halo is IMO the worst of both worlds at the moment due to having the worst specs in both dimensions and the least mature software stack.


This is 100% the truth, and I am really puzzled to see people push Strix Halo so much for local inference. For about $1200 more you can just build a DDR5 + 5090 machine that will crush a Strix Halo with just about every MoE model (equal decode and 10-20x faster prefill for large, and huge gaps for any MoE that fits in 32GB VRAM). I'd have a lot more confidence in reselling a 5090 in the future than a Strix Halo machine, too.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: