Fault Tolerance¶

VCCL’s fault-tolerance mechanism ensures that, in the event of NIC down or switch failures, distributed training can be recovered and continue within a single iteration, significantly improving the reliability and availability of large-scale clusters.

Overview¶

Fault-tolerance Capabilities¶

Failure Detection: Automatically detects node and link failures.
Automatic Recovery: Transparent failure recovery mechanisms.
Zero Downtime: In-place recovery within a single iteration.
High Compatibility: Highly compatible with traditional solutions.

Supported Failure Types¶

Failure Type	Recovery Strategy	Recovery Time
NIC down	Fault tolerance	Within 1 iteration
Switch failure	Fault tolerance	Within 1 iteration
NIC flap	Avoid excessive re-attachment	Handled by hardware retransmission mechanisms
GPU failure	Node isolation	Checkpoint-based recovery

Configuration¶

Basic Enablement¶

# Enable fault-tolerance feature (disabled by default)
export NCCL_ENABLE_FAULT_TOLERANCE=<0, 1>, default is 0 (disabled).

# NIC configuration must be specified
export NCCL_IB_HCA=="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" according to runtime environment.

Advanced Configuration¶

# Set retry count (default 7)
export NCCL_IB_RETRY_COUNT=7

# Set timeout in seconds (default 18)
export NCCL_IB_TIMEOUT=18

NIC configuration requirement

The fault-tolerance feature requires the NCCL_IB_HCA environment variable to be specified; otherwise it will not function correctly.

Advanced configuration

Setting advanced parameters beyond reasonable ranges may affect behavior.