ESXI8安装16.5版本NVIDIA VGPU--基于Tesla M60
小提示:
A5000/A6000显卡默认是显示模式的,如果要使用vGPU功能,需要先切换为无显示模式才能使用vGPU。
切换模式需要使用NVIDIA的displaymodeselector工具,该工具只支持在Window和linux平台使用,ESXI目前是没办法直接用的。如果你是ESXI平台,可以将显卡拆下来放到Win/Linux物理机又或者将显卡直通到Win/Linux的虚拟机里切换好在来安装vGPU。
工具下载
# displaymodeselector工具 Windows版本
https://share.feijipan.com/s/fqFt9mRH
# displaymodeselector工具 Linux版本
https://share.feijipan.com/s/AFFt9s8c
displaymodeselector工具切换教学(Window为例)
在CMD窗口,执行一下命令:
.\displaymodeselector.exe --gpumode
然后根据提示关闭就行,关闭后就可以进行vGPU的安装。
NVIDIA VGPU安装
1)在ESXI界面开启SSH服务,并进入维护模式(把ESXI里运行的虚拟机关闭电源否则无法进入维护模式)后面要通过SSH工具登录到ESXI后台安装驱动
2)将下载下来的NVIDIA VGPU驱动进行解压,把Host_Drivers文件夹里的nvd-gpu-mgmt-daemon_535.161.05-0.0.0000_23230587.zip和NVD-VGPU-800_535.161.05-1OEM.800.1.0.20613240_23233605.zip并上传到ESXI的/tmp路径
3)通过SSH工具登录到ESXI,安装NVIDIA VGPU驱动
# 安装 NVD-VGPU-800_535.161.05-1OEM.800.1.0.20613240_23233605.zip
对于vGPU 13.x:esxcli software vib install -d /tmp/NVD-VGPU*.zip
对于vGPU 15.x及之后需要执行俩次命令:
esxcli software vib install -d /tmp/NVD-VGPU*.zip
esxcli software vib install -d /tmp/nvd-gpu-mgmt-daemon*.zip
# 安装 nvd-gpu-mgmt-daemon_535.161.05-0.0.0000_23230587.zip
esxcli software component apply -d /tmp/nvd-gpu-mgmt-daemon*.zip
# 安装完成后,执行重启
reboot
安装记录如下:
Last login: Sat Aug 17 23:47:48 on ttys001
july@JulysiMac ~ % ssh root@10.10.10.251
(root@10.10.10.251) Password:
The time and date of this login have been sent to the system logs.
WARNING:
All commands run on the ESXi shell are logged and may be included in
support bundles. Do not provide passwords directly on the command line.
Most tools can prompt for secrets or accept them from standard input.
VMware offers powerful and supported automation tools. Please
see https://developer.vmware.com for details.
The ESXi Shell can be disabled by an administrative user. See the
vSphere Security documentation for more information.
[root@ESXI8:~] ls /tmp
NVD-VGPU-800_535.161.05-1OEM.800.1.0.20613240_23233605.zip vmware-root
nvd-gpu-mgmt-daemon_535.161.05-0.0.0000_23230587.zip vmware-uid_0
[root@ESXI8:~] esxcli software component apply -d /tmp/NVD-VGPU*.zip
Installation Result
Message: Operation finished successfully.
Components Installed: NVD-VGPU-800_535.161.05-1OEM.800.1.0.20613240
Components Removed:
Components Skipped:
Reboot Required: false
DPU Results:
[root@ESXI8:~] esxcli software component apply -d /tmp/nvd-gpu-mgmt-daemon*.zip
Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Components Installed: nvd-gpu-mgmt-daemon_535.161.05-0.0.0000
Components Removed:
Components Skipped:
Reboot Required: true
DPU Results:
[root@ESXI8:~] reboot
[root@ESXI8:~] Connection to 10.10.10.251 closed by remote host.
Connection to 10.10.10.251 closed.
4) 重启后执行nvidia-smi命令来验证驱动是否正常,输出类似如下信息:
[root@ESXI8:~] nvidia-smi
Sat Aug 17 16:14:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.05 Driver Version: 535.161.05 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla M60 On | 00000000:03:00.0 Off | 0 |
| N/A 64C P8 26W / 150W | 25MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:04:00.0 Off | 0 |
| N/A 50C P8 25W / 150W | 25MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1050448 G Xorg 4MiB |
| 1 N/A N/A 1050467 G Xorg 4MiB |
+---------------------------------------------------------------------------------------+
5) 由于Tesla M60默认是开启ECC内存,A系列和B系列VGPU mdev 模式是不支持ECC数据校验完整性的,所以这里要做禁用
# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Sat Aug 17 17:50:26 2024
Driver Version : 535.161.05
Attached GPUs : 2
GPU 00000000:03:00.0
[...]
Ecc Mode
Current : Enabled
Pending : Enabled
[...]
# 关闭ECC内存,注意,启用或者关闭ECC都需要重启ESXI主机
nvidia-smi -e 0
# 重启
reboot
6)确认驱动正常没有问题后退出维护模式,并将ESXI接入到vCenter进行管理。
vCenter分配VGPU
1)VGPU相关设置都需要在vCenter进行,所以得把ESXI主机接入到vCenter。登录到vCenter将显卡和ESXI主机从默认的共享类型(VSGA)改为直接共享类型(SR-IOV)
步骤:vCenter → ESXI主机 → 配置 → 图形 → 图形设备 → 编辑 → 直接共享 → 确定
步骤:vCenter → ESXI主机 → 配置 → 图形 → 主机图形 → 编辑 → 直接共享 → 确定
2) 然后就可以在虚拟机配置文件添加VGPU设备了
其中 NVIDIA GRID vGPU grid_m60-1b 是 mdev 的名字,grid_m60--显卡名,1--1G 显存,b 代表 vPC
关于最后一位字母,如下
A = Virtual Applications (vApps)适用于虚拟应用/共享桌面等场景
B = Virtual Desktops (vPC)-适用于拥有标准PC应用程序、浏览器和多媒体的虚拟桌面。常用于办公场景
Q = Virtual Workstations (vWS)-适用于专业级图形应用程序,如Al、深度学习和数据科学。性能最佳
3)安装Guest驱动,将NVIDIA VGPU驱动解压出来的Guest_Drivers文件夹里的驱动放到虚拟机安装。(Windows11为例)