> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this...

jjk166 · 2025-01-16T00:03:09 1736985789

Redundant cores lead to a fault tolerant chip.

ryao · 2025-01-16T15:10:10 1737040210

ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.

jjk166 · 2025-01-16T21:39:37 1737063577

Redundancy is a method of providing fault tolerance, the existence of other methods doesn't make it less fault tolerant.

Nothing is tolerant to all possible faults. Fault tolerance refers to being able to tolerate specific types of faults under specific conditions.

Fault tolerant is the proper term for this.

ryao · 2025-01-17T02:39:31 1737081571

I think it would have been better to write redundant. It is more specific.