> I’m actually not sure if Zig’s async design even uses hardware call/return pai...

loeg · 2025-10-27T06:12:06 1761545526

Even so. You're talking about storing and loading at least ~16 8-byte registers, including the instruction pointer which is essentially a jump. Even to L1 that takes some time; more than a simple function call (jump + pushed return address).

lukaslalinsky · 2025-10-27T06:17:33 1761545853

Only stack and instruction pointer are explicitly restored. The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

See this for more details on how stackful coroutines can be made much faster:

https://photonlibos.github.io/blog/stackful-coroutine-made-f...

messe · 2025-10-27T06:39:08 1761547148

> The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

Yep, the frame pointer as well if you're using it. This is exactly how its implemented in user-space in Zig's WIP std.Io branch green-threading implementation: https://github.com/ziglang/zig/blob/ce704963037fed60a30fd9d4...

On ARM64, only fp, sp and pc are explicitly restored; and on x86_64 only rbp, rsp, and rip. For everything else, the compiler is just informed that the registers will be clobbered by the call, so it can optimize allocation to avoid having to save/restore them from the stack when it can.

flimflamm · 2025-10-27T07:34:07 1761550447

Is this just buttering the cost of switches by crippling the optimization options compiler have?

lukaslalinsky · 2025-10-27T07:56:55 1761551815

If this was done the classical C way, you would always have to stack-save a number of registers, even if they are not really needed. The only difference here is that the compiler will do the save for you, in whatever way fits the context best. Sometimes it will stack-save, sometimes it will decide to use a different option. It's always strictly better than explicitly saving/restoring N registers unaware of the context. Keep in mind, that in Zig, the compiler always knows the entire code base. It does not work on object/function boundaries. That leads to better optimizations.

hawk_ · 2025-10-27T08:28:29 1761553709

This is amazing to me that you can do this in Zig code directly as opposed to messing with the compiler.

lukaslalinsky · 2025-10-27T09:13:43 1761556423

See https://github.com/alibaba/PhotonLibOS/blob/2fb4e979a4913e68... for GNU C++ example. It's a tiny bit more limited, because of how the compilation works, but the concept is the same.

messe · 2025-10-27T08:45:21 1761554721

To be fair, this can be done in GNU C as well. Like the Zig implementation, you'd still have to use inline assembly.

hawk_ · 2025-10-27T09:07:47 1761556067

> If this was done the classical C way, you would always have to stack-save a number of registers

I see, so you're saying that GCC can be coaxed into gathering only the relevant registers to stack and unstack not blindly do all of them?

messe · 2025-10-27T12:22:46 1761567766

Yes, you write inline assembly that saves the frame pointer, stack pointer, and instruction pointer to the stack, and list every other register as a clobber. GCC will know which ones its using at the call-site (assuming the function gets inlined; this is more likely in Zig due to its single unit of compilation model), and save those to the stack. If it doesn't get inlined, it'll be treated as any other C function and only save the ones needed to be preserved by the target ABI.

GoblinSlayer · 2025-10-27T09:34:09 1761557649

I wonder how you see it. Stackful coroutines switch context on syscall in the top stack frame, the deeper frames are regular optimized code, but syscall/sysret is already big context switch. And read/epoll loop has exactly same structure, the point of async programming isn't optimization of computation, but optimization of memory consumption. Performance is determined by features and design (and Electron).

hawk_ · 2025-10-27T08:03:55 1761552235

What do you mean by "buttering the cost of switches", can you elaborate? (I am trying to learn about this topic)

masfuerte · 2025-10-27T10:06:35 1761559595

I think it is

> buttering the cost of switches [over the whole execution time]

The switches get cheaper but the rest of the code gets slower (because it has less flexibility in register allocation) so the cost of the switches is "buttered" (i.e. smeared) over the rest of the execution time.

But I don't think this argument holds water. The surrounding code can use whatever registers it wants. In the worst case it saves and restores all of them, which is what a standard context switch does anyway. In other words, this can be better and is never worse.

ori_b · 2025-10-27T17:19:40 1761585580

Which, with store forwarding, can be shockingly cheap. You may not actually be hitting L1, and if you are, you're probably not hitting it synchronously.

https://easyperf.net/blog/2018/03/09/Store-forwarding

and, section 15.10 of https://www.agner.org/optimize/microarchitecture.pdf

loeg · 2025-10-27T19:59:12 1761595152

Are you talking about context switching every handful of cycles? This is going to be extremely inefficient even with store forwarding.

ori_b · 2025-10-28T09:48:16 1761644896

Sure, and so is calling a function every handful of cycles. That's a big part of why compilers inline.

Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.