Introduction
NVIDIA DCGM Exporter is a monitoring tool built on NVIDIA DCGM Go APIs that enables the collection of GPU metrics. It helps users monitor GPU health, analyze workload behavior, and gain visibility into GPU usage across Kubernetes clusters.
The DCGM Exporter exposes GPU metrics through an HTTP endpoint (/metrics), which can be scraped by monitoring systems such as Prometheus.
Prerequisites
- The NVIDIA DCGM Exporter pod must be deployed on every Kubernetes node that has GPUs.
- Ensure that the exporter is running and accessible before configuring metric collection.
Kubernetes 2.0 ConfigMap
To enable GPU metric collection, update or append the existing ConfigMap named opsramp-workload-metric-user-config by adding the DCGM Exporter configuration under the workloads section.
Edit the ConfigMap
kubectl edit cm opsramp-workload-metric-user-config -n opsramp-agentExample Configuration
Add the nvidia-dcgm/prometheus configuration under the workloads section as shown below:
- Use port
9400, which is the default port on which DCGM Exporter exposes metrics.
(You can confirm this by describing the DCGM Exporter pod.) - Specify a matching label in
targetPodSelectorto correctly identify the DCGM Exporter pods.
apiVersion: v1
kind: ConfigMap
metadata:
name: opsramp-workload-metric-user-config
namespace: opsramp-agent
data:
workloads: |
nvidia-dcgm/prometheus:
- name: nvidia-dcgm
collectionFrequency: 2m
port: 9400
auth: none
metrics_path: "/metrics"
scheme: "http"
targetPodSelector:
matchLabels:
- key: app
operator: ==
value:
- nvidia-dcgm-exporterNote
Ensure that all field values are configured according to your Kubernetes cluster environment and DCGM Exporter deployment.Metrics Filtering
By default, all DCGM Exporter metrics are collected.
If required, users can optionally apply metric filtering based on:
- Metric name (full name or regular expression)
- Action (include or exclude)
This allows you to control which metrics are included in the final metric list.
Default behavior: All DCGM Exporter metrics are included.
Example
nvidia-dcgm/prometheus:
- name: nvidia-dcgm
collectionFrequency: 2m
port: 9400
auth: none
metrics_path: "/metrics"
scheme: "http"
filters: # optional
- regex: 'DCGM_FI_PROF_GR_ENGINE_ACTIVE'
action: exclude
targetPodSelector:
matchLabels:
- key: app
operator: ==
value:
- nvidia-dcgm-exporterConfigMap for Alerting
To receive alerts when GPU metrics collected from the DCGM Exporter breach defined thresholds, configure alert rules in the opsramp-alert-user-config ConfigMap under the rules section.
This configuration enables alerting on DCGM Exporter metrics at the Kubernetes pod level.
Example
apiVersion: v1
kind: ConfigMap
metadata:
name: "opsramp-alert-user-config"
namespace: {{ include "common.names.namespace" . | quote }}
labels:
app: "opsramp-alert-user-config"
data:
alert-definitions.yaml: |
alertDefinitions:
- resourceType: k8s_pod
rules:
- name: gpu_fb_memory_used_percentage
interval: 2m
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
component: "${labels.gpu}"
isAvailability: false
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Pod ${resource.name} is consuming GPU framebuffer memory usage of ${metric.value}%.'
alertBody: '${severity}. GPU framebuffer memory usage on resource: ${resource.name} is ${metric.value}%.'Supported Metrics
Supported metrics for this workload as provided by the Kubernetes 2.0 Agent.
| Metric | Description | Unit |
|---|---|---|
| DCGM_FI_DEV_SM_CLOCK | SM clock frequency | MHz |
| DCGM_FI_DEV_MEM_CLOCK | Memory clock frequency | MHz |
| DCGM_FI_DEV_MEMORY_TEMP | Memory temperature | C |
| DCGM_FI_DEV_GPU_TEMP | Current temperature readings for the device | C |
| DCGM_FI_DEV_POWER_USAGE | Power usage for the device | W |
| DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Total energy consumption for the GPU since the driver was last reloaded | mJ |
| DCGM_FI_DEV_PCIE_REPLAY_COUNTER | Total number of PCIe retries | count |
| DCGM_FI_DEV_GPU_UTIL | GPU utilization | % |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory utilization | % |
| DCGM_FI_DEV_ENC_UTIL | Encoder utilization | % |
| DCGM_FI_DEV_DEC_UTIL | Decoder utilization | % |
| DCGM_FI_DEV_XID_ERRORS | Value of the last XID error encountered | count |
| DCGM_FI_DEV_FB_FREE | Free frame buffer | MiB |
| DCGM_FI_DEV_FB_USED | Used frame buffer | MiB |
| DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | Number of remapped rows for uncorrectable errors | count |
| DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | Number of remapped rows for correctable errors | count |
| DCGM_FI_DEV_ROW_REMAP_FAILURE | Indicates whether row remapping has failed | count (flag) |
| DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Total number of NVLink bandwidth counters for all lanes | count |
| DCGM_FI_DEV_VGPU_LICENSE_STATUS | vGPU license status (0 = not licensed, 1 = licensed) | count (flag) |
| DCGM_FI_PROF_GR_ENGINE_ACTIVE | Ratio of time the graphics engine is active | ratio |
| DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Ratio of cycles any tensor pipe is active | ratio |
| DCGM_FI_PROF_DRAM_ACTIVE | Ratio of cycles the device memory interface is active | ratio |
| DCGM_FI_PROF_PCIE_TX_BYTES | Number of bytes transmitted over PCIe from the GPU | bytes/sec |
| DCGM_FI_PROF_PCIE_RX_BYTES | Number of bytes received over PCIe by the GPU | bytes/sec |