*Detecting* that the service went down is easy. Notifying every proxy in the fle...

__turbobrew__ · 2025-10-27T20:46:49 1761598009

I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?

It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.

tptacek · 2025-10-27T21:15:56 1761599756

Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.

The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.

I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.

__turbobrew__ · 2025-10-27T22:02:52 1761602572

> The interesting problem is noticing than and then informing every proxy in the world.

Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.

Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.

tptacek · 2025-10-27T22:05:22 1761602722

Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.

We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.

__turbobrew__ · 2025-10-27T23:42:59 1761608579

I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.

I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.

tptacek · 2025-10-27T23:43:49 1761608629

The design you are suggesting makes no sense for us. That's OK! It's an interesting conversation. But no, you can't fix the problem we're trying to solve with shuffle shard.

otterley · 2025-10-27T21:24:12 1761600252

Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)

JoachimSchipper · 2025-10-27T19:52:57 1761594777

(Hopping in here because the discussion is interesting... feel very free to ignore.)

Thanks for writing this up! It was a very interesting read about a part of networking that I don't get to seriously touch.

That said: I'm sure you guys have thought about this a lot and that I'm just missing something, but "why can't every proxy probe every [worker, not application]?" was exactly one of the questions I had while reading.

Having the workers being the source-of-truth about applications is a nicely resilient design, and bruteforcing the problem by having, say 10k proxies each retrieve the state of 10k workers every second... may not be obviously impossible? Somewhat similar to sending/serving 10k DNS requests/s/worker? That's not trivial, but maybe not _that_ hard? (You've been working on modern Linux servers a lot more than I, but I'm thinking of e.g. https://blog.cloudflare.com/how-to-receive-a-million-packets...)

I did notice the sentence about "saturating our uplinks", but... assuming 1KB=8Kb of compressed critical state per worker, you'd end up with a peak bandwidth demand of about 80 Mbps of data per worker / per proxy; that may not be obviously impossible? (One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.)

(Obviously, bruteforcing the routing table does not get you out of doing _something_ more clever than that to tell the proxies about new workers joining/leaving the pool, and probably a hundred other tasks that I'm missing; but, as you imply, not all tasks are equally timing-critical.)

The other question I had while reading was why you need one failure/replication domain (originally, one global; soon, one per-region); if you shard worker state over 100 gossip (SWIM Corrosion) instances, obviously your proxies do need to join every sharded instance to build the global routing table - but bugs in replication per se should only take down 1/100th of your fleet, which would hit fewer customers (and, depending on the exact bug, may mean that customers with some redundancy and/or autoscaling stay up.) This wouldn't have helped in your exact case - perfectly replicating something that takes down your proxies - but might make a crash-stop of your consensus-ish protocol more tolerable?

Both of the questions above might lead to a less convenient programming model, which be enough reason on its own to scupper it; an article isn't necessarily improved by discussing every possible alternative; and again, I'm sure you guys have thought about this a lot more than I did (and/or that I got a couple of things embarassingly wrong). But, well, if you happen to be willing to entertain my questions I would appreciate it!

DAlperin · 2025-10-27T20:47:59 1761598079

(I used to work at Fly, specifically on the proxy so my info may be slightly out of date, but I've spent a lot of time thinking about this stuff.)

> why can't every proxy probe every [worker, not application]?

There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:

> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.

This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.

There are ways to solve this! We could make it possible for proxies to relay each others state. To recap: - We have workers that poll each other - They exchange diffs rather than the full state - The state diffs can be relayed by other proxies

We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.

As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.

JoachimSchipper · 2025-10-27T21:26:35 1761600395

Thanks a lot for your response!

tptacek · 2025-10-27T20:28:12 1761596892

Hold up, I sniped Dov into answering this instead of me. :)