
NVIDIA's Multi-Instance GPU (MIG) technology enables efficient resource utilization by partitioning a single GPU into multiple independent instances. However, MIG mode introduces constraints for traditional GPU utilization monitoring. This article describes the GPU monitoring challenges in MIG environments and presents a solution based on DCGM_FI_PROF_GR_ENGINE_ACTIVE.
According to NVIDIA's official documentation, traditional GPU utilization metrics are not supported on GPUs with MIG mode enabled.

The NVIDIA MIG User Guide states this restriction explicitly:
"GPU utilization is not supported when MIG mode is enabled."

This is where the dilemma arises.
Consider the following scenario:
How do you calculate the average GPU utilization of the node in such an environment? Since the 4 MIG-enabled GPUs report N/A for utilization, should you exclude them and calculate based on only the 4 physical devices?
DCGM_FI_PROF_GR_ENGINE_ACTIVEThere is a way to accurately measure GPU utilization even in MIG mode — by using the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.

Overall GPU utilization = Σ (utilization of each MIG instance × the compute-slice ratio of that instance)
Example (on an A100, which has 7 compute slices in total):
Total GPU utilization: 50.0%
This approach has a clear technical foundation. Both the choice of the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric and the weighted-calculation methodology are grounded in NVIDIA's official documentation and recommendations. The rationale for the alternative metric and the calculation methodology is explained below.
Source of evidence: NVIDIA DCGM GitHub Issue #64
An NVIDIA developer responded:
"DCGM_FI_DEV_GPU_UTIL is roughly equal to DCGM_FI_PROF_GR_ENGINE_ACTIVE. DCGM_FI_PROF_GR_ENGINE_ACTIVE is higher precision and works on MIG."
This answer provides the official basis for why DCGM_FI_PROF_GR_ENGINE_ACTIVE can be used as an alternative indicator of GPU utilization.
GR_ENGINE_ACTIVEDCGM_FI_PROF_GR_ENGINE_ACTIVEDCGM_FI_DEV_GPU_UTIL, which often reports only 0% or 100% in practice.
Source of evidence: NVIDIA DCGM- Understanding Metrics
Based on the metric definitions in the NVIDIA DCGM official documentation, a methodology for converting per-MIG-instance utilization into device-level utilization through weighted calculation can be derived.
Total GPU Utilization = Σ (Instance_Utilization × Slice_Ratio)
For each MIG instance:
Instance_Utilization = DCGM_FI_PROF_GR_ENGINE_ACTIVE (range: 0.0 to 1.0)Slice_Ratio = Instance_Compute_Slices / Total_GPU_Compute_SlicesWeighted_Contribution = Instance_Utilization × Slice_RatioOn an A100 GPU (7 compute slices in total):
Total GPU utilization: 0.1714 + 0.1286 + 0.0571 + 0.0286 + 0.1000 + 0.0143 = 0.5000 (50.0%)
Note on A100 slice counts: The A100 has 7 compute (SM) slices and 8 memory slices. This is why the MIG profile naming convention is{compute}g.{memory}gb— for example,1g.5gbmeans 1 compute slice and 5 GB of memory (1/8 of the 40 GB total). The weighting in this methodology is based on the compute slice count (the number beforeg), not memory.
DCGM_FI_DEV_GPU_UTIL to DCGM_FI_PROF_GR_ENGINE_ACTIVE for GPU cost calculations, citing higher precision and MIG support. The PR author noted that this was confirmed with a DCGM product manager.DCGM_FI_DEV_GPU_UTIL to DCGM_FI_PROF_GR_ENGINE_ACTIVE (dcgm-exporter Issue #341) because the legacy metric only reports 0% or 100%.GPU monitoring in MIG mode can be solved based on two key foundations:
DCGM_FI_PROF_GR_ENGINE_ACTIVE can be used as an alternative indicator for GPU_UTIL, and it supports MIG with higher precision.Combined, these provide an accurate, technically defensible method for measuring overall GPU utilization even when MIG is enabled — closing the gap left by the legacy DCGM_FI_DEV_GPU_UTIL metric.