k3sでNvidia GPUを使用する

k3s

2024-09-14

#目次

はじめに
環境
NVIDIA Container Toolkitのインストール
k3sのインストール
まとめ

#はじめに

k3sでNvidia GPUを使用する方法について記載します。
公式ドキュメントを参考にしながら取り組んでいたのですが、すこしハマったので、その時のメモを残します。

#環境

OS: Ubuntu 24.04

GPU: NVIDIA GeForce RTX 2060

VMではなく、物理マシンを使用しています。

#NVIDIA Container Toolkitのインストール

こちらのリンクを参考にインストールをすすめていきます。

aptのリポジトリの設定を行う

# Configure the production repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
 && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
 sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Optionally, configure the repository to use experimental packages:
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update the packages list from the repository:
sudo apt-get update

NVIDIA Container Toolkitなどをインストールする

apt install -y nvidia-container-runtime cuda-drivers-fabricmanager-535 nvidia-driver-535  nvidia-utils-535

nvidia-container-runtimeは、コンテナ内からGPUを利用するためのランタイムです。

cuda-drivers-fabricmanager-535は、CUDA関係のパッケージのようですが、詳細は不明です。

nvidia-driver-535は、NVIDIAのドライバーです。

nvidia-utils-535は、nvidia-smiなどのユーティリティが含まれています。

k3sのドキュメントでは、nvidia-headless-515-serverが指定されていますが、GPU計算用途の場合は-headless-, -serverをつけるようです。
次のリンクに解説があります。

What is the NVIDIA Server Driver?

In the Software & Updates app in Ubuntu 20.04, in the Additional Drivers tab, there is an NVIDIA Server Driver option that I don't remember seeing in previous Ubuntu releases. The package name is

askubuntu.com

今回は、デスクトップ用途でも使用したいので、nvidia-driver-535を指定しています。

こちらのリンクを参考に、dockerからnvidia-smiが実行できることを確認。

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Sat Sep 14 11:46:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:26:00.0 Off |                  N/A |
| 29%   34C    P8               9W / 160W |    222MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

#k3sのインストール

k3sのドキュメントを参考にインストールをすすめていきます。

curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="600" sh -

k3sがnvidia container runtimeを見つけているか確認します。

❯ sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

k3sのドキュメントではこのあと、Runtimeの設定と確認用のPodを作成する手順が記載されていますが、私の場合、これらの手順はうまく行きませんでした。代わりにこの記事を参考にNvidia GPU OperatorをデプロイすることでGPUを利用できるようになりました。

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

# first, install the helm utility
sudo snap install helm

# we first install the helm repo for nvidia and update
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

# install the operator
helm install --wait nvidiagpu \
     -n gpu-operator --create-namespace \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
     nvidia/gpu-operator

GPU動作確認用のPodを作成します。

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
EOF

ログを確認します。きちんと動いていそうです。

kubectl logs nbody-gpu-benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance)
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation)
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Turing" with compute capability 7.5

> Compute 7.5 CUDA device: [NVIDIA GeForce RTX 2060]
30720 bodies, total time for 10 iterations: 49.252 ms
= 191.609 billion interactions per second
= 3832.171 single-precision GFLOP/s at 20 flops per interaction

gpu-operatorのPodは以下のようになります。

kubectl get po -n gpu-operator
NAME                                                      READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-7b6r9                               1/1     Running     1          16m
gpu-operator-565fdd79b9-26wgt                             1/1     Running     0          17m
nvidia-container-toolkit-daemonset-stw5b                  1/1     Running     0          16m
nvidia-cuda-validator-fn7k9                               0/1     Completed   0          16m
nvidia-dcgm-exporter-c5x2v                                1/1     Running     1          16m
nvidia-device-plugin-daemonset-2k976                      1/1     Running     1          16m
nvidia-operator-validator-v78ns                           1/1     Running     0          16m
nvidiagpu-node-feature-discovery-gc-55bdc4bcc9-9g28d      1/1     Running     0          17m
nvidiagpu-node-feature-discovery-master-7bdb8df6b-j7kmz   1/1     Running     0          17m
nvidiagpu-node-feature-discovery-worker-shcnr             1/1     Running     0          17m
nvidiagpu-node-feature-discovery-worker-vb4hc             1/1     Running     0          17m

#まとめ

今回は、k3sでNvidia GPUを使用する方法について記載しました。
GPUを利用するためには、NVIDIA Container Toolkitをインストールする必要がありますが、何度もやるような作業ではないので、つまずきがちです。
一旦、k3s上でGPUを利用できるようになったので今後は、AIモデルのデプロイなどを行っていきたいと思います。
Dify, ollama, Whisperなど試してみたいものがたくさんあります。