Kubernetes GPU管理探秘第二篇：在Kubernetes中启用 Nvidia MPS

Jun 20, 2024

背景介绍

多进程服务（Multi-Process Service，MPS）是一个与CUDA API二进制兼容的客户端-服务器运行时实现，它由几个组件构成：

控制守护进程：负责启动和停止服务器，并协调客户端与服务器之间的连接。
客户端运行时：集成在CUDA驱动库中的MPS客户端运行时，可被任何CUDA应用程序透明地使用。
服务器进程：作为客户端共享访问GPU的桥梁，提供客户端间的并发。本文档将介绍Volta MPS相较于前Volta GPU上的MPS的新功能，并指出两者之间的差异。在Volta上运行MPS会自动启用新功能。

mps

在K8s中使用NVIDIA MPS

根据来自NVIDIA GPU Device Plugin v0.15.0的说明，这个MPS功能还没有准备好用于生产环境，并包括一些已知问题，包括：

device plugin可能在准备好分配共享GPU之前就显示为已启动，同时等待CUDA MPS控制守护进程上线。
在重启或配置更改下，CUDA MPS控制守护进程和GPU device plugin之间没有同步。这意味着，如果工作负载在失去对由CUDA MPS控制守护进程控制的共享资源的访问时可能会崩溃。
MPS仅支持完整的GPU。
不可能“组合”MPS GPU请求，以允许单个容器访问更多内存。

本文将对该功能进行验证，以便在未来的版本中使用。

环境准备

k8s集群信息

Kernel Version: 5.15.135-amd64
containerd version 1.6.20
Kubelet Version: v1.26.11
k8s-device-plugin version: v0.15.0
nvidia-toolkit upgrade: v1.13.4

安装 Nvidia GPU Operator

首先，我们需要安装Nvidia GPU Operator，以提供GPU资源的监控指标。

helm install --wait --generate-name -n ssdl-vgpu nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set devicePlugin.enabled=false

准备MPS对应的configMap

# k apply -f ntr-mps-cm.yaml
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ntr-mps-cm
  namespace: ssdl-vgpu
data:
  any: |
    version: v1
    flags:
      failOnInitError: true
      nvidiaDriverRoot: "/run/nvidia/driver/"
      plugin:
        deviceListStrategy: envvar
        deviceIDStrategy: uuid
    sharing:
      mps:
        resources:
          - name: nvidia.com/gpu
            replicas: 4

安装 nvidia-device-plugin

helm install --wait k8s-vgpu nvdp/nvidia-device-plugin \
  --namespace ssdl-vgpu \
  --version 0.15.0 \
  --set config.name=ntr-time-cm \
  --set compatWithCPUManager=true \

 helm list -n ssdl-vgpu | grep k8s-vgpu
k8s-vgpu               	ssdl-vgpu	3       	2024-04-23 14:04:36.885665 +0800 CST	deployed	nvidia-device-plugin-0.15.0	0.15.0

在nvidia-device-plugin安装完成后，我们可以查看DaemonSet的状态。

k get DaemonSet -n ssdl-vgpu | grep k8s-vgpu
k8s-vgpu-nvidia-device-plugin                           1         1         1       1            1           <none>                                                                      6h36m
k8s-vgpu-nvidia-device-plugin-mps-control-daemon        1         1         1       1            1           nvidia.com/mps.capable=true                                                 6h36m

标记目标node以启用MPS

基于上面的输出，我们可以看到DaemonSet k8s-vgpu-nvidia-device-plugin-mps-control-daemon正在运行。我们可以标记目标node以启用MPS。

k label no -l worker.garden.sapcloud.io/group=c32m384-v100g3 nvidia.com/mps.capable=true

结果验证

MPS control daemon and device-plugin的日志

MPS control daemon logs

2024-04-23T06:04:59.093748766Z I0423 06:04:59.092447      55 main.go:78] Starting NVIDIA MPS Control Daemon 435bfb70
2024-04-23T06:04:59.093839331Z commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093853836Z I0423 06:04:59.092736      55 main.go:55] "Starting NVIDIA MPS Control Daemon" version=<
2024-04-23T06:04:59.093867206Z 	435bfb70
2024-04-23T06:04:59.093878874Z 	commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093889630Z  >
2024-04-23T06:04:59.093900576Z I0423 06:04:59.092874      55 main.go:107] Starting OS watcher.
2024-04-23T06:04:59.094030621Z I0423 06:04:59.093463      55 main.go:121] Starting Daemons.
2024-04-23T06:04:59.094062157Z I0423 06:04:59.093551      55 main.go:164] Loading configuration.
2024-04-23T06:04:59.094686515Z I0423 06:04:59.094616      55 main.go:172] Updating config with default resource matching patterns.
2024-04-23T06:04:59.094774486Z I0423 06:04:59.094718      55 main.go:183] 
2024-04-23T06:04:59.094792480Z Running with config:
2024-04-23T06:04:59.094796229Z {
2024-04-23T06:04:59.094799702Z   "version": "v1",
2024-04-23T06:04:59.094802996Z   "flags": {
2024-04-23T06:04:59.094806479Z     "migStrategy": "none",
2024-04-23T06:04:59.094809741Z     "failOnInitError": true,
2024-04-23T06:04:59.094813226Z     "nvidiaDriverRoot": "/run/nvidia/driver/",
2024-04-23T06:04:59.094816715Z     "gdsEnabled": null,
2024-04-23T06:04:59.094820702Z     "mofedEnabled": null,
2024-04-23T06:04:59.094823952Z     "useNodeFeatureAPI": null,
2024-04-23T06:04:59.094827235Z     "plugin": {
2024-04-23T06:04:59.094830383Z       "passDeviceSpecs": null,
2024-04-23T06:04:59.094833529Z       "deviceListStrategy": [
2024-04-23T06:04:59.094836708Z         "envvar"
2024-04-23T06:04:59.094840338Z       ],
2024-04-23T06:04:59.094844148Z       "deviceIDStrategy": "uuid",
2024-04-23T06:04:59.094847589Z       "cdiAnnotationPrefix": null,
2024-04-23T06:04:59.094850905Z       "nvidiaCTKPath": null,
2024-04-23T06:04:59.094854382Z       "containerDriverRoot": null
2024-04-23T06:04:59.094857781Z     }
2024-04-23T06:04:59.094861393Z   },
2024-04-23T06:04:59.094870585Z   "resources": {
2024-04-23T06:04:59.094873870Z     "gpus": [
2024-04-23T06:04:59.094877070Z       {
2024-04-23T06:04:59.094882321Z         "pattern": "*",
2024-04-23T06:04:59.094885531Z         "name": "nvidia.com/gpu"
2024-04-23T06:04:59.094888738Z       }
2024-04-23T06:04:59.094891938Z     ]
2024-04-23T06:04:59.094895130Z   },
2024-04-23T06:04:59.094898375Z   "sharing": {
2024-04-23T06:04:59.094901556Z     "timeSlicing": {},
2024-04-23T06:04:59.094904765Z     "mps": {
2024-04-23T06:04:59.094908027Z       "failRequestsGreaterThanOne": true,
2024-04-23T06:04:59.094911202Z       "resources": [
2024-04-23T06:04:59.094914378Z         {
2024-04-23T06:04:59.094917668Z           "name": "nvidia.com/gpu",
2024-04-23T06:04:59.094920946Z           "devices": "all",
2024-04-23T06:04:59.094924462Z           "replicas": 4
2024-04-23T06:04:59.094927756Z         }
2024-04-23T06:04:59.094931025Z       ]
2024-04-23T06:04:59.094934399Z     }
2024-04-23T06:04:59.094937659Z   }
2024-04-23T06:04:59.094940881Z }
2024-04-23T06:04:59.094944445Z I0423 06:04:59.094737      55 main.go:187] Retrieving MPS daemons.
2024-04-23T06:04:59.367188348Z I0423 06:04:59.366922      55 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
2024-04-23T06:04:59.443650558Z I0423 06:04:59.443468      55 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
2024-04-23T06:04:59.445568425Z [2024-04-23 06:04:59.389 Control    72] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
2024-04-23T06:04:59.445618049Z [2024-04-23 06:04:59.389 Control    72] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
2024-04-23T06:04:59.445627888Z [2024-04-23 06:04:59.415 Control    72] Accepting connection...
2024-04-23T06:04:59.445635791Z [2024-04-23 06:04:59.415 Control    72] NEW UI
2024-04-23T06:04:59.445643954Z [2024-04-23 06:04:59.415 Control    72] Cmd:set_default_device_pinned_mem_limit 0 4096M
2024-04-23T06:04:59.445651895Z [2024-04-23 06:04:59.415 Control    72] UI closed
2024-04-23T06:04:59.445659400Z [2024-04-23 06:04:59.418 Control    72] Accepting connection...
2024-04-23T06:04:59.445667180Z [2024-04-23 06:04:59.418 Control    72] NEW UI
2024-04-23T06:04:59.445674877Z [2024-04-23 06:04:59.418 Control    72] Cmd:set_default_device_pinned_mem_limit 1 4096M
2024-04-23T06:04:59.445682790Z [2024-04-23 06:04:59.419 Control    72] UI closed
2024-04-23T06:04:59.445690190Z [2024-04-23 06:04:59.439 Control    72] Accepting connection...
2024-04-23T06:04:59.445697598Z [2024-04-23 06:04:59.439 Control    72] NEW UI
2024-04-23T06:04:59.445705544Z [2024-04-23 06:04:59.439 Control    72] Cmd:set_default_device_pinned_mem_limit 2 4096M
2024-04-23T06:04:59.445713416Z [2024-04-23 06:04:59.439 Control    72] UI closed
2024-04-23T06:04:59.445720849Z [2024-04-23 06:04:59.442 Control    72] Accepting connection...
2024-04-23T06:04:59.445728295Z [2024-04-23 06:04:59.442 Control    72] NEW UI
2024-04-23T06:04:59.445735742Z [2024-04-23 06:04:59.442 Control    72] Cmd:set_default_active_thread_percentage 25
2024-04-23T06:04:59.445749210Z [2024-04-23 06:04:59.442 Control    72] 25.0
2024-04-23T06:04:59.445757584Z [2024-04-23 06:04:59.442 Control    72] UI closed
2024-04-23T06:05:26.660115113Z [2024-04-23 06:05:26.659 Control    72] Accepting connection...
2024-04-23T06:05:26.660166122Z [2024-04-23 06:05:26.659 Control    72] NEW UI
2024-04-23T06:05:26.660179349Z [2024-04-23 06:05:26.659 Control    72] Cmd:get_default_active_thread_percentage
2024-04-23T06:05:26.660189426Z [2024-04-23 06:05:26.659 Control    72] 25.0

Nvidia-device-plugin logs

I0423 06:04:56.479370      39 main.go:279] Retrieving plugins.
I0423 06:04:56.480927      39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:04:56.480999      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0423 06:04:56.558122      39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0423 06:04:56.558152      39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
I0423 06:05:26.587365      39 main.go:315] Stopping plugins.
I0423 06:05:26.587429      39 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.587496      39 main.go:200] Starting Plugins.
I0423 06:05:26.587508      39 main.go:257] Loading configuration.
I0423 06:05:26.588327      39 main.go:265] Updating config with default resource matching patterns.
I0423 06:05:26.588460      39 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/run/nvidia/driver/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 4
        }
      ]
    }
  }
}
I0423 06:05:26.588478      39 main.go:279] Retrieving plugins.
I0423 06:05:26.588515      39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:05:26.588559      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0423 06:05:26.660603      39 server.go:176] "MPS daemon is healthy" resource="nvidia.com/gpu"
I0423 06:05:26.661346      39 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0423 06:05:26.662857      39 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.673960      39 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

E2E 校验

✅成功的案例 -> cuda-samples/vectorAdd

cuda-samples 所对应的日志，从中我们可以看到 MPS已经成功地启用了。

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

nvidia-smi 对应的输出


+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    0   N/A  N/A   1397691    M+C   /cuda-samples/vectorAdd                      18MiB |
|    0   N/A  N/A   1397696    M+C   /cuda-samples/vectorAdd                      10MiB |
|    1   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    1   N/A  N/A   1397689    M+C   /cuda-samples/vectorAdd                      94MiB |
|    1   N/A  N/A   1397714    M+C   /cuda-samples/vectorAdd                      10MiB |
|    2   N/A  N/A   1127738      C   nvidia-cuda-mps-server                       30MiB |
|    2   N/A  N/A   1397712    M+C   /cuda-samples/vectorAdd                      10MiB |
+---------------------------------------------------------------------------------------+

❌失败的案例 -> tf-notebook tensorflow/tensorflow:latest-gpu-jupyter

我在尝试使用这个classification.ipynb来进行端到端验证，然而结果是pod持续挂起60分钟没有任何响应，并且日志无法获取更多细节。

classification.ipynb的日志输出

2024-04-25 02:10:19.499819: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I 2024-04-25 02:10:20.630 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.
[I 2024-04-25 02:10:36.700 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.

我在此处尝试了echo get_default_active_thread_percentage | nvidia-cuda-mps-control（来自于https://github.com/NVIDIA/k8s-device-plugin/issues/647），一切看起来都很正常，但是挂起的问题仍然存在。

root@gpu-pod:/# echo get_default_active_thread_percentage | nvidia-cuda-mps-control
25.0