When running GPU workloads on Kubernetes, one of the most important things is GPU performance monitoring. In this article, we'll take a step-by-step look at the entire process of activating an NVIDIA GPU in an actual operating environment based on a local minikube test environment and collecting detailed GPU metrics through DCGM-Exporter.
This guide was tested in the following environments:
Works in other environments: This guide works on most Linux environments with NVIDIA GPUs. The principles are the same regardless of the GPU model or driver version.
At first, I spent a lot of time trying to use a complicated GPU Operator, Minikube has a simple method built in.
💡 About --gpu = all
-gpus all
The flag automatically processes the following tasks:Nvidia-device-plugin
Automatically activate add-onsCheck settings
After the installation is complete, check that NVIDIA.com/gpu appears at least 1 in the node's allocatable resources list with the following command:
DCGM-Exporter can also be installed as a Helm Chart or GPU Operator. Too complicated in a minikube environmentI will. Installing directly in YAML is simpler and easier to understand.
Basic DCGM-Exporter installation
Deploy and verify
Accessing metric endpoints
Check metric data
Expected output
📈 Understanding Prometheus metric formats
Why do we need custom metrics?
Basic metrics (Check out the full list) In addition, in the actual operating environment, the required indicators may vary depending on the situation. For example, there are situations where monitoring of GPU device metadata and network load is required, as shown below. In this case, it is necessary to collect additional metrics in addition to basic metrics.
Configuring metrics via ConfigMap
Step 1: Create an extended metrics ConfigMap
Step 2: Deploy the ConfigMap
Step 3: Update DaemonSet
Step 4: Apply updates
Testing new metrics
Expected output
Now that we've collected GPU metrics, we'll discuss the following in the next post.