Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

All synchronous code is an illusion created in software, as is the very notion of "blocking". The CPU doesn't block for IO. An OS thread is a (scheduled) "stackful coroutine" implemented in the OS that gives the illusion of blocking where there is none.

The only problem is that the OS implements that illusion in a way that's rather costly, allowing only a relatively small number of threads (typically, you have no more than a few thousand frequently-active OS threads), while languages, which know more about how they use the stack, can offer the same illusion in a way that scales to a higher number of concurrent operations. But there's really no more magic in how a language implements this than in how the OS implements it, and no more illusion. They are both a mostly similar implementation of the same illusion. "Blocking" is always a software abstraction over machine operations that don't actually block.

The only question is how important is it for software to distinguish the use of the same software abstraction between the OS and the language's implementation.





Unfortunately, the illusion of an OS thread relies on keeping a single consistent stack. Stackful coroutines (implemented on top of kernel threads) break this model in a way that has many detrimental effects; stackless ones do not.

It is true that in some languages there could be difficulties due to the language's idiosyncrasies of implementation, but it's not an intrinsic difficulty. We've implemented virtual threads in the JVM, and we've used the same Thread API with no issue.

Yep the JVM structured concurrency implementation is amazing. One thing I got wondering especially when reading this post on HN though is if stackless coroutines could (have) fit the JVM in some way to get even better performance for those who may care.

They wouldn't have had better performance, though. There is no significant performance penalty we're paying, although there's a nuance here that may be worth pointing out.

There are two different usecases for coroutines that may tempt implementors to address with a single implementation, but the usecases are sufficiently different to separate into two different implementations. One is the generator use case. What makes it special is that there are exactly two communicating parties, and both of their state may fit in the CPU cache. The other use case is general concurrency, primarily for IO. In that situation, a scheduler juggles a large number of user-mode threads, and because of that, there is likely a cache miss on every context switch, no matter how efficient it is. However, in the second case, almost all of the performance is due to Little's law rather than context switch time (see my explanation here: https://inside.java/2020/08/07/loom-performance/).

That means that a "stackful" implementation of user-mode threads can have no significant performance penalty for the second use case (which, BTW, I think has much more value than the first), even though a more performant implementation is possible for the first use case. In Java we decided to tackle the second use case with virtual threads, and so far we've not offered something for the first (for which the demand is significantly lower).

What happens in languages that choose to tackle both use cases with the same construct is that they gain negligible performance in the second use case (at best), but they're paying for that negligible benefit with a substantial degradation in user experience. That's just a bad tradeoff, but some languages (especially low-level ones) may have little choice, because their stackful solution does carry a significant performance cost compared to Java because of Java's very efficient heap memory management.


The OS allocates your thread stack in a very similar way that a coroutine runtime allocates the coroutine stack. The OS will swap the stack pointer and a bunch more things in each context switch, the coroutine runtime will also swap the stack pointer and some other things. It's really the same thing. The only difference is that the runtime in a compiled language knows more about your code than the OS does, so it can make assumptions that the OS can't and that's what makes user-space coroutines lighter. The mechanisms are the same.

And the stackless runtime will use some other register than the stack pointer to access the coroutine's activation frame, leaving the stack pointer register free for OS and library use, and avoiding the many drawbacks of fiddling with the system stack as stackful coroutines do. It's the same thing.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: