I just installed Linux on an old computer that used to be powerful that has a Nvidia Quadro 2000D GPU. I wanted to see how to use the GPU to speed up computation done in a simple Python program. It took me some time and some hand holding to get there. Let me share the journey and the results.

In a nutshell: Using the GPU has overhead costs. If the computation is not heavy enough, then the cost (in time) of using a GPU might be larger than the gain. On the other hand if the computation is heavy, you can see a huge improvement in speed.

Installation

Once I had Ubuntu 17.10 installed (desktop edition) I went to install all the Python stuff. Ubuntu already comes with Python 3. I installed virtualenv using apt and then using pip I've installed all the Python modules I needed in the virtualenv. That did not work out well. When I ran the code that was supposed to use the GPU (see the code below) I got an error:

numba.cuda.cudadrv.error.NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
library nvvm not found

OK. I deactivated the virtualenv and installed Miniconda.

Some stuff I needed for the rest of the installation:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo aptitude install nvidia-cuda-dev
sudo aptitude install python3-dev

Install Miniconda:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

Install the tools I needed:

conda install cudatoolkit
conda install numpy
conda install numba

Once I've managed to install everything I wrote a short Python script based on this article.

examples/python/vector_addition_with_gpu.py

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys

if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")


@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a + b

def main():
    N = int(sys.argv[2])
    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)

    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    #print("C[:5] = " + str(C[:5]))
    #print("C[-5:] = " + str(C[-5:]))
    print("Time: {}".format(elapsed_time))

main()

That was an utter disappointment. The GPU actually slowed down the operations though if we look closely we can see that the slow down ratio is much worse for the smaller vectors. I tried creating bigger matrices and multiply them, but there is a limit to the size of the matrices this GPU can handle. Whenever I supplied a number that created a matrix which was too big I got an exception:

numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

Even when I got close to the limit the CPU was still a lot faster than the GPU.

$ python speed.py cpu 100000
Time: 0.0001056949986377731
$ python speed.py cuda 100000
Time: 0.11871792199963238

$ python speed.py cpu 11500000
Time: 0.013704434997634962
$ python speed.py cuda 11500000
Time: 0.47120747699955245

In the meantime I was monitoring the GPU using nvidia-smi. Occasionally it showed that the Python process is running, but otherwise it was not useful to me.

watch -n 0.5 nvidia-smi

I asked on Stack Overflow. I got some good suggestions, especially from Ignacio Vergara Kausel. He pointed me to a notebook (that apparently was also linked from a comment in the original article I skimmed) that showed a much better example. It even showed the problem I was facing. That is the GPU slowing down the operations.

Based on that article I've created a new script that would be able to demonstrate the speed improvement.

examples/python/math_with_gpu.py

import numpy as np
from numba import vectorize
import math
from timeit import default_timer as timer

@vectorize(['float32(float32, int32)'], target='cpu')
def with_cpu(x, count):
    for _ in range(count):
        x = math.sin(x)
    return x

@vectorize(['float32(float32, int32)'], target='cuda')
def with_cuda(x, count):
    for _ in range(count):
        x = math.sin(x)
    return x

data = np.random.uniform(-3, 3, size=1000000).astype(np.float32)

for c in [1, 10, 100, 1000]:
    print(c)
    for f in [with_cpu, with_cuda]:
        start = timer()
        r = f(data, c)
        elapsed_time = timer() - start
        print("Time: {}".format(elapsed_time))

$ python math_with_gpu.py
1
Time: 0.011146817007102072
Time: 0.13940549999824725
10
Time: 0.12172191697754897
Time: 0.023090153001248837
100
Time: 1.2920606719853822
Time: 0.03889427299145609
1000
Time: 12.961911355989287
Time: 0.1976439240097534

That is, if we calculate sinus once, the CPU is still faster. If we calculate it 10 times, the GPU is already faster. (I think I found them roughly the same speed at 6 iterations.)

The actual ratios of the 4 measurement points are here:

 0.08
 5.27
33.22
65.58

It surprised me that the speed improvement only doubled when we went from 100 iteration to 1000 iteration. I though it would have a much bigger impact. I'll need to create more complex examples to understand that part.

In an case, we can see that for tasks with relatively low computational complexity we might actually lose time using the GPU, but as our computations becomes more and more intensive, we do see a gain even from a relatively old and weak GPU.