All synchronous code is an illusion created in software, as is the very notion o...

zozbot234 · 2025-10-27T16:04:27 1761581067

Unfortunately, the illusion of an OS thread relies on keeping a single consistent stack. Stackful coroutines (implemented on top of kernel threads) break this model in a way that has many detrimental effects; stackless ones do not.

pron · 2025-10-27T18:25:23 1761589523

It is true that in some languages there could be difficulties due to the language's idiosyncrasies of implementation, but it's not an intrinsic difficulty. We've implemented virtual threads in the JVM, and we've used the same Thread API with no issue.

hawk_ · 2025-10-27T19:38:54 1761593934

Yep the JVM structured concurrency implementation is amazing. One thing I got wondering especially when reading this post on HN though is if stackless coroutines could (have) fit the JVM in some way to get even better performance for those who may care.

pron · 2025-10-27T21:28:54 1761600534

They wouldn't have had better performance, though. There is no significant performance penalty we're paying, although there's a nuance here that may be worth pointing out.

There are two different usecases for coroutines that may tempt implementors to address with a single implementation, but the usecases are sufficiently different to separate into two different implementations. One is the generator use case. What makes it special is that there are exactly two communicating parties, and both of their state may fit in the CPU cache. The other use case is general concurrency, primarily for IO. In that situation, a scheduler juggles a large number of user-mode threads, and because of that, there is likely a cache miss on every context switch, no matter how efficient it is. However, in the second case, almost all of the performance is due to Little's law rather than context switch time (see my explanation here: https://inside.java/2020/08/07/loom-performance/).

That means that a "stackful" implementation of user-mode threads can have no significant performance penalty for the second use case (which, BTW, I think has much more value than the first), even though a more performant implementation is possible for the first use case. In Java we decided to tackle the second use case with virtual threads, and so far we've not offered something for the first (for which the demand is significantly lower).

What happens in languages that choose to tackle both use cases with the same construct is that they gain negligible performance in the second use case (at best), but they're paying for that negligible benefit with a substantial degradation in user experience. That's just a bad tradeoff, but some languages (especially low-level ones) may have little choice, because their stackful solution does carry a significant performance cost compared to Java because of Java's very efficient heap memory management.

lukaslalinsky · 2025-10-27T16:12:47 1761581567

The OS allocates your thread stack in a very similar way that a coroutine runtime allocates the coroutine stack. The OS will swap the stack pointer and a bunch more things in each context switch, the coroutine runtime will also swap the stack pointer and some other things. It's really the same thing. The only difference is that the runtime in a compiled language knows more about your code than the OS does, so it can make assumptions that the OS can't and that's what makes user-space coroutines lighter. The mechanisms are the same.

zozbot234 · 2025-10-27T16:17:31 1761581851

And the stackless runtime will use some other register than the stack pointer to access the coroutine's activation frame, leaving the stack pointer register free for OS and library use, and avoiding the many drawbacks of fiddling with the system stack as stackful coroutines do. It's the same thing.