The scale and complexity of advanced AI models makes it necessary to distribute AI training across multiple server nodes. Each node also leverages hardware accelerators like GPUs to boost performance. Examples of hardware running these distributed AI models include Meta AI research supercluster, AWS Sagemaker, Azure ML and others. In spite of the sizable computational power of these clusters, AI applications need to be carefully optimized to make the most out of the underlying hardware. This is where great performance monitoring and profiling tools become indispensable.
While there are existing solutions for monitoring (Open telemetry) as well as profiling CPUs and GPUs, it is challenging to assemble them together to get a holistic view of the system. For example, we need to understand whether an inefficiency on one resource, like the communication fabric, is slowing down the overall computation. Additionally, these solutions need to work on real world applications without causing performance degradation.
To address these challenges, we are open-sourcing Dynolog—a lightweight monitoring daemon for heterogeneous CPU-GPU systems. Dynolog supports always-on performance monitoring as well as deep-dive profiling modes. The latter can be activated by making a remote procedure call to the daemon. In order to support AI training applications, Dynolog also integrates with the PyTorch profiler and Kineto CUDA profiling library. Read about some of Dynolog’s features in this post and explore the GitHub repo to access available resources.
Below are some of the key features, which we will explore in more detail later in the post:
Dynolog offers first-class integration with the PyTorch Profiler and provides on-demand remote tracing features. One can issue a single CLI command (dyno) to simultaneously trace hundreds of GPUs and examine the collected traces (available from PyTorch v1.13 onwards).
It incorporates GPU performance monitoring for NVIDIA GPUs using DCGM.
It also manages counters for micro-architecture specific performance events related to CPU Cache, TLBs, memory controllers on Intel and AMDCPUs. Additionally, it instruments telemetry from the Linux kernel including CPU, network and IO resource usage.
We focus on Linux platforms as it is leveraged heavily in cloud environments. The sections below dive into the continuous monitoring and profiling modes in Dynolog.
Continuous monitoring of metrics helps to pinpoint performance or reliability bottlenecks and when or where regressions occur. At this point, the user can dive deeper and understand the root cause. Dynolog leverages various interfaces by hardware and the Linux kernel to instrument these metrics.
CPU/storage/network: First, we instrument basic counters from the Linux kernel giving us an overview of the CPU, disk and network utilization. This is achieved by reading the Linux procfs API.
CPU Performance Monitoring Unit counters: The fraction of time a CPU is utilized may not give a full picture of performance bottlenecks (as pointed out in this blog). For example, a CPU could be waiting on the memory to provide data and still show up as utilized. Since there are multiple contention points on a modern CPU processor, we need to dive deeper into its micro-architecture. Most modern CPUs provide Peformance Monitoring Units (PMUs) that can be programmed using a Linux perf_event API. We integrate reading these PMU counters (examples include L1/L2 cache misses, TLB misses and memory reads and writes). Our code currently supports Intel and AMD CPUs, but it is extensible to support more CPU architectures in the future.
GPUs: In heterogeneous systems with GPUs, performance analysis is even more interesting. A GPU could be underutilized due to slow CPU-to-GPU or GPU-to-GPU memory transfers. We may also not be using advanced instructions that provide orders-of-magnitude higher performance. To provide this sort of introspection into the utilization of different resources/interfaces on the GPU, we integrate with the NVIDIA Datacenter GPU Manager library. These metrics can help AI developers find bottlenecks at a glance, like the compute (SM occupancy), memory bandwidth (HBM bandwidth), NVLink, PCIE bandwidth and others.
Caption: This image shows an example of a GPU metric: sm_active_ratio metric. NVIDIA GPUs include about 100 units called Streaming multiprocessors (SMs). The SM active ratio gives us an average measure of the utilization of all SMs in a GPU. In this example we only use 12.8 percent of available computational resources on the GPU, showing that there is room for improvement. Read more about supported metrics here.
Debugging performance on distributed AI applications is challenging as one needs to understand how multiple CPUs and GPUs interact with each other. Any wait or delay can lead to a drop in performance and increase the training time. Developers can examine their PyTorch training applications programmatically using the PyTorch Profiler API. The profiler provides a wealth of information on how CPUs and GPUs interact. However, the user needs to edit their python code to get this data and likely relaunch their jobs.
Currently, there are no good tools to trigger a CPU-GPU profile on demand, that is, at any point in time without making code changes. CPUs, however, have on-demand profiler tools like Linux perf that have existed for many years.
Dynolog addresses this gap for CPU-GPU profiling by integrating with the PyTorch GPU profiling library Kineto. Essentially, Dynolog provides a service that enables the users to set their AI application into the profiling mode. The figure below illustrates this flow:
Caption: Diagram showing the profiling mode process flow described above: 1) user triggers a trace using dyno command line tool and the request is sent over the network to Dynolog daemon, 2) Dynolog forwards profiling request parameters to Kineto profiling library, 3) Kineto uses CUDA profiling tools (libcupti) to capture the timeline trace and 4) Trace saves to a shared drive or other remote object store.
We ship a script called unitrace.py that works with SLURM, a popular tool to schedule and run jobs on HPC clusters. The script automatically sends out requests to all compute nodes running a distributed training job. The team is happy to collaborate on adding support for other job schedulers and environments. GPU profiling support for more details.
Dynolog has a generic Logger class that can be customized to publish data to various stores. For example, one use case might be to create a logger for submitting time series data to a store. The Dynolog team is happy to collaborate on adding support for types of data stores and generic data stores: please reach out to us.
Dynolog provides a holistic approach to system and application monitoring. As we’ve explored in this blog post, this enables sophisticated workflows where we use continuous monitoring metrics and on-demand profiling. Two key takeaways include:
Dynolog instruments performance-monitoring metrics for heterogeneous systems consisting of CPUs and GPUs.
We worked closely with the PyTorch Profiler team to enable on-demand trace collection for PyTorch applications on any platform (AWS, Azure, private clouds).
Our top use cases today are enabling observability for the Meta AI Research clusters including the AI Research SuperCluster. We will strive to ensure these tools are cohesive and well-integrated to enable engineers and researchers to better optimize their AI models.
We have developed the Dynolog client command line tool using Rust. The Dynolog team is enthusiastic about writing new components in Rust, using the foreign function interface to integrate with the existing modules written in C++. Rust for a small module, like the client, paves the road for future development in the Rust programming language. The Dynolog server is a complex multi-threaded process. The compile time correctness checks provided by Rust will enable a more reliable telemetry service going forward.
We will keep Dynolog development open source first to ensure features are made available to the community first. This helps keep the project in sync with companion open source projects like Pytorch Kineto Profiler, Param benchmarks and LLVM (for Intel Processor Trace based debugging).