GPU, CUDA12, Docker and Tensorflow

Whilst AI has grabbed the headlines with tools like ChatGPT and Midjourney, there is more to it than that.  Using your own data and local services, you can train your own models to gain insights from your own tailored data sets.

We announced our vGPU driven cluster here at the end of last year, and while most of the current use cases revolve around accelerated desktop and RDS for graphics capability, I am more interested in AI/ML type operations and how we can use the GPU profiles for that.

Some time ago we had an internal development project based on a workstation with multiple graphics cards, Linux and Docker.  While it worked well, it wasn't quite ready for the prime time.   So how do we fare on a 'proper' solution?

The new cluster uses Dell Servers with A40 NVIDIA grid gpu.  We have enterprise licensing and access to the NVIDIA tools.  What do we need to do differently to make use of this?

Firstly we deploy a VM on the new cluster with a default vGPU profile.  Boot from Ubuntu 22.04 iso and start following the normal setup.   Hint:  the NVIDIA cuda tools and dependencies are quite large, so boot disk needs to be a reasonable size (100GB).

First change from normal is:

The answer here is "NO".   The open source drivers may look like they detect that hardware, but if the drivers try to run they fail as they do not support the A40 GPU. So let the install proceed without them, and we will add the licensed driver and license token shortly.

After rebooting into new VM, first thing we do is install dkms.  

apt install -y dkms

This is the dynamic kernal module system.  It will install a bundle of compile and build tools to allow the NVIDIA kernel driver to be built to match whatever the current kernel version you are running.

Next step is that we mount the guest drivers on the VM via an ISO.  We have an ISO loaded into MyCloudSpace which contains the guest packages and our nvidia license token,  so for obvious reasons I'm not posting that here.  The guest drivers are available to anyone with access to MCS or the NVIDIA licensing portal.

root@prs-gpu01:~# mount /dev/cdrom /mntmount: /mnt: WARNING: source write-protected, mounted read-only.
root@prs-gpu01:~# ls /mnt/guest_drivers
52824_grid_win10_win11_ser.exe nvidia-linux-grid-525_5258.deb nvidia-linux-grid-525-5258.rpm nvidia-linux-x86_64-525850.run
root@prs-gpu01:~# ls /mnt/client_configuration_token.tok/mnt/client_configuration_token.tok

We need to install the driver, copy the token to /etc/nvidia/ClientConfigToken/ and edit gridd.conf to enable the appropriate feature

dpkg -i /mnt/guest_drivers/nvidia-linux-grid-525_5258.deb
cp /mnt/client_configuration_token.tok /etc/nvidia/ClientConfigToken/
sed -i 's/FeatureType=0/FeatureType=4/g' /etc/nvidia/gridd.conf

Which means we want to be a 'Virtual Compute Server' rather than unlicensed.  At this point we either restart nvidia-gridd or reboot.  And we will do a couple of basic checks to see that the vGPU is working and licensed.

sysadmin@prs-gpu01:~$ nvidia-smi -q | egrep '(Product|License)'
Product Name : NVIDIA A40-8Q
Product Brand : NVIDIA RTX Virtual Workstation
Product Architecture : Ampere
vGPU Software Licensed Product
Product Name : NVIDIA RTX Virtual Workstation
License Status : Licensed (Expiry: 2023-3-5 20:3:8 GMT)

Well, that's a great start, We have a linux VM with the correct NVIDIA grid drivers installed and we can see the A40 vGPU profile passed through from the underlying cluster.

But now we need the tools to work.  Another gotcha is that if you try and install e.g libcuda from the Ubuntu repositories, it will unistall the grid drives and install the open source drives that we skipped at install time.  Same if you install the  cuda packages from the NVIDIA site. In fact, NVIDIA do post a handy warning.  Buried deep in the Grid vGPU User Guide.

So we need to grab the runfile package and install that.

wget https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installers/cuda_12.0.1_525.85.12_linux.run

sudo sh cuda_12.0.1_525.85.12_linux.run

This gives you the option to skip the drivers.

Let that complete and note the output.   I found that the relevant info had been installed in /etc/ld.so.conf.d/cuda-12-0.conf and the demo tools work

Now we just need to give it a trash.  All the cool kids use Tensorflow and Keras right?

However there are a heap of dependency issues between the standard Ubuntu repositories, PyPI pip modules and the NVIDIA grid

Some of these  use the nvidia-specific packages that allow us to use the newer hardware. with tensorrt pinned to older 7.2.x versions.

There are also mentions of tensorflow-gpu as a specific package.  But these have all been rolled up into the latest, and as long as we are using tensorflow 2.11 then it shoud have full GPU and Python 3.10 support (as of Feb 2023).

But even then normal tensorflow (2.11) package is still linked to cuda 11.2 and rt 7.2.x, so we need to use tf-nightly - but that does not find the NVIDIA cuda drivers!

To use the latest drivers, tensorflow and make use of the Ampere GPU we need to use the NVIDIA NGC Hub, and from here on using the docker images is much easier.

So we install docker and the NVIDIA Container Toolkit as per the docs.

....

We get the expected output...

Running the tensorflow image pulls a heap of layers down.  Again we need to get an image where the container driver version matches the host.  

23.01-tf2-py3 is no good

sudo docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/tensorflow:22.12-tf2-py3

Use v22.12 to ensure the docker drivers match the host driver.  The downside for us as a service provider is that if we update the hosts, then a customer VMs will also need to be updated.  

My Keras training info originally came from here.  But that docker image is 5 years old, that's one of the pitfalls with relying on 3rd party images that go out of date. However we have redone all hard work that was in this image.   We just want to use the same training since I want to compare the new platform against our older dev platforms. So we use the same python script that grabs test data and does a short training run.

In the container, use wget to grab the script above and execute,

Training took ~7.3 seconds.   That's for a container, on a VM, on a host.

This is faster than the high end multi graphics card platform we had, and this vGPU is faster than the last generation (P4o 24GB/3840 cuda cores/Training 20s) when I last ran the test with access to the full hardware. (Docker container on hardware)

It's also faster than Tensorflow 1, with CUDA 11.2 on a VM (without the docker layer) on the same hardware

So the vGPU offering, while having less memory (8GB), it has 3 times the CUDA cores 10752 and training takes less than half the time.   I'd call that a win.

Using the latest NVIDIA grid drivers, whilst a bit of a version minefield, does yield noticeable performance benefits.