(I used to work at Fly, specifically on the proxy so my info may be slightly out of date, but I've spent a lot of time thinking about this stuff.)
> why can't every proxy probe every [worker, not application]?
There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:
> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.
This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.
There are ways to solve this! We could make it possible for proxies to relay each others state. To recap:
- We have workers that poll each other
- They exchange diffs rather than the full state
- The state diffs can be relayed by other proxies
We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.
As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.
> why can't every proxy probe every [worker, not application]?
There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:
> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.
This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.
There are ways to solve this! We could make it possible for proxies to relay each others state. To recap: - We have workers that poll each other - They exchange diffs rather than the full state - The state diffs can be relayed by other proxies
We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.
As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.