CUDA Programming

Connections speed comparison: GPU - GPU Memory > CPU - CPU Memory > CPU Memory - GPU Memory
Kernel: A function that runs on the GPU. It is executed in parallel by multiple threads.
CUDA threads need to be independent.
GPU is a coprocessot to the CPU (aka host)
GPU is organized as a grid of blocks of threads.
- Grid of Blocks 1D or 2D
- Blocks of Threads 1D, 2D, or 3D
- Threads in the same block can communicate together, but threads in different blocks cannot communicate with each other.
Memory Model:
- Global Memory: Accessible by all threads; main mean of communication between host and device
- Textures and Constant Memory: Constants initialized by the host and available to all threads

Execution Model:
- Order is undefined; don’t assume bock 0 executes before block 1, or thread 0 executes before thread 1.
- All blocks are identical
- Same block threads can synchronize and share data fast
- Threads from differnt bock can’t cooperate and communicate only through global memory
Each thread executes the same kernel, but with different data.

Coding

Running a kenel on GPU usually has this form:

Allocate CPU data structure
Initialize data on CPU
Allocate GPU data structure
Copy data from CPU to GPU
Define execution configuration
Run Kernel
CPU synchronizes with GPU
Copy data from GPU to CPU
Deallocate GPU and CPU memory

CPU Memory Allocation

Use malloc

GPU Memory Allocation

Use cudaMalloc. It has the interface:

cudaMalloc(void **ptr, size_t nbytes);

Example: (Note that the pointer is passed to the function as a pointer to pointer and that it’s explicitly casted to void**)

float *da;
cudaMalloc((void**)&da, N * sizeof(float));

Note that cudaMalloc allocates space in the global memory of the GPU.

To free GPU memory, use cudaFree:

cudaFree(void *ptr);

Example:

cudaFree(da);

Another valuable function is cudaMemset which initializes GPU memory to a specific value:

cudaMemset(void *ptr, int value, size_t nbytes);

Example:

cudaMemset(da, 0, N * sizeof(float));

Copying Data Between CPU and GPU

To move data between host and device, use cudaMemcpy:

cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

Example:

cudaMemcpy((void*)da, (void*)a, N * sizeof(float), cudaMemcpyHostToDevice);

Regarding the direction parameter, it can take the following values:

cudaMemcpyHostToDevice: Copy data from host (CPU) to device (GPU)
cudaMemcpyDeviceToHost: Copy data from device (GPU) to host (CPU)
cudaMemcpyDeviceToDevice

This request is asynchronous from the CPU perspective, but the GPU will execute it in order with respect to other CUDA requests.

Defining Execution Configuration

When launching a kernel, we need to specify the number of blocks and threads per block. If you want to use \(N\) threads and each block has \(M\) threads, then you need \(\left\lceil \frac{N}{M} \right\rceil\) which can be computed as (N + M - 1) / M in integer division.

Running a Kernel and Synchronization

You use the function name followed by <<<blocks, threads_block>>> to launch a kernel. For example:

myKernel<<<blocks, threads_block>>>(args);

Here threads_block is the number of threads per block and blocks is the number of blocks.

To synchronize the CPU with the GPU (ensure GPU execution is complete), use cudaThreadSynchronize:

cudaThreadSynchronize();

The CPU doesn’t block on CUDA calls, and Kernel requests are queued and processed in order.

Copy Data Back to CPU and Deallocate Memory

Just use cudaMemcpy to copy data back to CPU and both cudaFree and free to deallocate GPU and CPU memory respectively.

Defining the GPU Kernel

To define a kernel, you use the __global__ qualifier.

In most kernels, you will need to access the unique thread ID. Here I mean the unique ID across all grids and blocks. You can compute it using gridDim.x, gridDim.y, blockIdx.x, blockIdx.y, blockDim.x, blockDim.y, blockDim.z, threadIdx.x, threadIdx.y, and threadIdx.z. The definitions are quite straightforward (somethingDim.d is the max allocated in d) and we are assuming 2D grids and 3D blocks. Of course this can be splified to lower dimensions if needed, where you won’t need to utilize all the dimensions. Needless to say, fill x, then y, then z.

The slides include extensive examples of how to compute the unique thread ID based on the dimensions of the grid and blocks (6 possible combinations). I won’t include them here, but generally speking, in all they first compute the block ID and then compute the thread ID within the block and then combine them to get the unique thread ID.

I will include only one example here for the case of 2D grid and 3D blocks (note that all other cases are just simplifications of this one; replace the dimensions with 1 and idx with 0 if they are not used). (Notice how the parentheses make it beautifull 😎)

UniqueBlockIndex = blockIdx.y * (gridDim.x)
                     + blockIdx.x;
UniqueThreadIndex = UniqueBlockIndex * (blockDim.x * blockDim.y * blockDim.z)
                     + threadIdx.z * (blockDim.x * blockDim.y)
                     + threadIdx.y * (blockDim.x) 
                     + threadIdx.x;

Warning

The lecture seems to be mistaken in this case where it replaced threadIdx.z * (blockDim.x * blockDim.y) with threadIdx.z * (blockDim.y * blockDim.z)

Tip

A cool side-project + blog would be to create an algorithm that takes \(N\) grids deep structure (not just grids containing blocks, but grids containing grids containing grids, etc.) and a vector of dimensions for each level, \(d \in \mathbb{N}^N\), and then generates the code to compute the unique thread ID for that structure.

Using a variable for the size of each level would be quite helpful

CUDA Function Declarations

There are different qualifiers for functions in CUDA:

Qualifier	Executed On	Called From
`__global__`	Device (GPU)	Host (CPU)
`__device__`	Device (GPU)	Device (GPU)
`__host__`	Host (CPU)	Host (CPU)

__global__ defines a kernel function. It must return void and can only call __device__ functions.

For function executed on device,

No pointer to the function can be generated
No recursion is allowed
No static variables are allowed
No variable number of arguments is allowed

Predefined Vector Data Types

Important

I have stopped the notes at slide 40

Should be completed at a later time