Skip to content

Fault Tolerance

VCCL’s fault-tolerance mechanism ensures that, in the event of NIC down or switch failures, distributed training can be recovered and continue within a single iteration, significantly improving the reliability and availability of large-scale clusters.

Overview

Fault-tolerance Capabilities

  • Failure Detection: Automatically detects node and link failures.
  • Automatic Recovery: Transparent failure recovery mechanisms.
  • Zero Downtime: In-place recovery within a single iteration.
  • High Compatibility: Highly compatible with traditional solutions.

Supported Failure Types

Failure Type Recovery Strategy Recovery Time
NIC down Fault tolerance Within 1 iteration
Switch failure Fault tolerance Within 1 iteration
NIC flap Avoid excessive re-attachment Handled by hardware retransmission mechanisms
GPU failure Node isolation Checkpoint-based recovery

Configuration

Basic Enablement

# Enable fault-tolerance feature (disabled by default)
export NCCL_ENABLE_FAULT_TOLERANCE=<0, 1>, default is 0 (disabled).

# NIC configuration must be specified
export NCCL_IB_HCA=="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" according to runtime environment.

Advanced Configuration

# Set retry count (default 7)
export NCCL_IB_RETRY_COUNT=7

# Set timeout in seconds (default 18)
export NCCL_IB_TIMEOUT=18

NIC configuration requirement

The fault-tolerance feature requires the NCCL_IB_HCA environment variable to be specified; otherwise it will not function correctly.

Advanced configuration

Setting advanced parameters beyond reasonable ranges may affect behavior.