Skip to content

Flow Telemetry

VCCL Flow Telemetry provides microsecond-level GPU-to-GPU point-to-point traffic measurement, helping users gain deep insights into distributed training communication patterns, identify performance bottlenecks, and perform precise optimizations.

Feature Overview

  • Real-time monitoring: provides microsecond-level GPU-to-GPU point-to-point traffic measurement
  • Congestion awareness: inference of network congestion conditions
  • Developer assistance: aids in R&D tuning and optimization

Config

Basic usage

# Enable telemetry
export NCCL_TELEMETRY_ENABLE=1

# Set data window size (default: 50)
export NCCL_TELEMETRY_WINDOWSIZE=100

# Provide two modes (default: 0): 
# 0 for trouble shooting, only print logs when detect performance degradation
# 1 for O(us) monitoring
export NCCL_TELEMETRY_OBSERVE=

# Set log output path
export NCCL_TELEMETRY_LOG_PATH=/tmp/vccl_telemetry

Changelog

2026.1.8 https://github.com/sii-research/VCCL/pull/21

This PR introduces a new environment variable NCCL_TELEMETRY_OBSERVE to differentiate between troubleshooting mode (value 0, default) and monitoring mode (value 1). The primary goal is to make the global timer log lock-free to prevent performance degradation in the NCCL proxy critical path.