CUDA Programming

CUDA Memory Model

CUDA Memory Model Summary

Coding

Running a kenel on GPU usually has this form:

  1. Allocate CPU data structure
  2. Initialize data on CPU
  3. Allocate GPU data structure
  4. Copy data from CPU to GPU
  5. Define execution configuration
  6. Run Kernel
  7. CPU synchronizes with GPU
  8. Copy data from GPU to CPU
  9. Deallocate GPU and CPU memory

CPU Memory Allocation

Use malloc

GPU Memory Allocation

Use cudaMalloc. It has the interface:

cudaMalloc(void **ptr, size_t nbytes);

Example: (Note that the pointer is passed to the function as a pointer to pointer and that it’s explicitly casted to void**)

float *da;
cudaMalloc((void**)&da, N * sizeof(float));

Note that cudaMalloc allocates space in the global memory of the GPU.

To free GPU memory, use cudaFree:

cudaFree(void *ptr);

Example:

cudaFree(da);

Another valuable function is cudaMemset which initializes GPU memory to a specific value:

cudaMemset(void *ptr, int value, size_t nbytes);

Example:

cudaMemset(da, 0, N * sizeof(float));

Copying Data Between CPU and GPU

To move data between host and device, use cudaMemcpy:

cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

Example:

cudaMemcpy((void*)da, (void*)a, N * sizeof(float), cudaMemcpyHostToDevice);

Regarding the direction parameter, it can take the following values:

  • cudaMemcpyHostToDevice: Copy data from host (CPU) to device (GPU)
  • cudaMemcpyDeviceToHost: Copy data from device (GPU) to host (CPU)
  • cudaMemcpyDeviceToDevice

This request is asynchronous from the CPU perspective, but the GPU will execute it in order with respect to other CUDA requests.

Defining Execution Configuration

When launching a kernel, we need to specify the number of blocks and threads per block. If you want to use \(N\) threads and each block has \(M\) threads, then you need \(\left\lceil \frac{N}{M} \right\rceil\) which can be computed as (N + M - 1) / M in integer division.

Running a Kernel and Synchronization

You use the function name followed by <<<blocks, threads_block>>> to launch a kernel. For example:

myKernel<<<blocks, threads_block>>>(args);

Here threads_block is the number of threads per block and blocks is the number of blocks.

To synchronize the CPU with the GPU (ensure GPU execution is complete), use cudaThreadSynchronize:

cudaThreadSynchronize();

The CPU doesn’t block on CUDA calls, and Kernel requests are queued and processed in order.

Copy Data Back to CPU and Deallocate Memory

Just use cudaMemcpy to copy data back to CPU and both cudaFree and free to deallocate GPU and CPU memory respectively.

Defining the GPU Kernel

To define a kernel, you use the __global__ qualifier.

In most kernels, you will need to access the unique thread ID. Here I mean the unique ID across all grids and blocks. You can compute it using gridDim.x, gridDim.y, blockIdx.x, blockIdx.y, blockDim.x, blockDim.y, blockDim.z, threadIdx.x, threadIdx.y, and threadIdx.z. The definitions are quite straightforward (somethingDim.d is the max allocated in d) and we are assuming 2D grids and 3D blocks. Of course this can be splified to lower dimensions if needed, where you won’t need to utilize all the dimensions. Needless to say, fill x, then y, then z.

The slides include extensive examples of how to compute the unique thread ID based on the dimensions of the grid and blocks (6 possible combinations). I won’t include them here, but generally speking, in all they first compute the block ID and then compute the thread ID within the block and then combine them to get the unique thread ID.

I will include only one example here for the case of 2D grid and 3D blocks (note that all other cases are just simplifications of this one; replace the dimensions with 1 and idx with 0 if they are not used). (Notice how the parentheses make it beautifull 😎)

UniqueBlockIndex = blockIdx.y * (gridDim.x)
                     + blockIdx.x;
UniqueThreadIndex = UniqueBlockIndex * (blockDim.x * blockDim.y * blockDim.z)
                     + threadIdx.z * (blockDim.x * blockDim.y)
                     + threadIdx.y * (blockDim.x) 
                     + threadIdx.x;
Warning

The lecture seems to be mistaken in this case where it replaced threadIdx.z * (blockDim.x * blockDim.y) with threadIdx.z * (blockDim.y * blockDim.z)

Tip

A cool side-project + blog would be to create an algorithm that takes \(N\) grids deep structure (not just grids containing blocks, but grids containing grids containing grids, etc.) and a vector of dimensions for each level, \(d \in \mathbb{N}^N\), and then generates the code to compute the unique thread ID for that structure.

Using a variable for the size of each level would be quite helpful

CUDA Function Declarations

There are different qualifiers for functions in CUDA:

Qualifier Executed On Called From
__global__ Device (GPU) Host (CPU)
__device__ Device (GPU) Device (GPU)
__host__ Host (CPU) Host (CPU)

__global__ defines a kernel function. It must return void and can only call __device__ functions.

For function executed on device,

  • No pointer to the function can be generated
  • No recursion is allowed
  • No static variables are allowed
  • No variable number of arguments is allowed

Predefined Vector Data Types

Important

I have stopped the notes at slide 40

Should be completed at a later time