### **Introduction to High Performance Computing**

\\

SDS406 – Fall semester, 2024 - 2025

//

L05: Introduction to GPU programming, 1<sup>st</sup> November 2024

# Outline

### Introductory part

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of the GPU nodes of our cluster

# Outline

### **Introductory part**

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of the GPU nodes of our cluster

### **Practical examples on GPUs**

Covering:

- GPU performance vs CPU performance —— this lecture
- Memory coalescing on GPUs —— this lecture
- Shared memory next week
- Details of GPU thread scheduling (warps) and why you should care next week

### **GPU architecture**



#### CPU

- Few heavy cores
- Large memory
- Moderate BW to memory
- Optimized for serial execution

### GPU

- Many light "cores"
- Smaller memory
- High BW to memory
- Optimized for parallel execution



#### Some numbers from our cluster's nodes: cyc nodes (NVIDIA definitions)

- NVIDIA P100 "Pascal" GPUs
- 56 Streaming Multiprocessors (SM) per GPU
- 32 FP64 or 64 FP32 "cores" per SM
- GPU memory: 16 GBytes
- Clock rate: ~1.5 GHz
- Cache (L2): 4 MBytes
- BW: 732 GB/s



#### Some numbers from our cluster's nodes: cyc nodes (NVIDIA definitions)

- NVIDIA P100 "Pascal" GPUs
- 56 Streaming Multiprocessors (SM) per GPU
- 32 FP64 or 64 FP32 "cores" per SM
- GPU memory: 16 GBytes
- Clock rate: ~1.5 GHz
- Cache (L2): 4 MBytes
- BW: 732 GB/s

We will come back to these numbers during the practical



#### "Offload" model of programming

- CPU starts program (runs main())
- CPU copies data to GPU memory (over e.g. PCIe, ~16 GB/s)
- CPU dispatches "kernels" for execution on GPU
  - Kernels read/write to GPU memory (~732 GB/s)
  - Kernels run on GPU threads (thousands) which share fast memory [O(10) times faster compared to GPU memory]
- Kernel completes; CPU copies data back from GPU (over e.g. PCIe, ~16 GB/s)



#### **GPU** memory model (NVIDIA model)

- GPU threads: slow access to global, constant, and texture memory
- Each thread has registers (fast) and local memory (slow)
- Threads are grouped into blocks; Threads within the same block: shared memory (fast)



#### GPU memory model (NVIDIA model); some numbers for context

- Threads per block: 1024 (max)
- Register memory (per block): 256 KB
- Shared memory (per block): 64 KB



#### **GPU** memory model (NVIDIA model)

- Assumptions about execution order
  - Threads within the same block can be assumed to run concurrently
  - No assumption about the order by which blocks are executed

# **CUDA programming model**

#### NVIDIA programming framework for NVIDIA GPUs

- Compute Unified Device Architecture
- C-like programming language for writing CUDA Kernels
  - Includes C/C++ and Fortran variants
  - Compiler for C/C++: nvcc
- Functions for transferring data to/from GPUs, starting kernels, etc.
- Some higher-level functionality also available (linear algebra, random number generations, etc.)
- Concepts generalizable to other accelerator programming frameworks (OpenCL, OpenACC, HiP, etc.)

#### Nomenclature

- "Host" is the CPU
- "Device" is the GPU

#### Allocate memory on GPU

err = cudaMalloc(&d\_ptr, size);

- Call from host (CPU)
- Allocate size bytes of memory on GPU and store the starting address in d\_ptr
- d\_ptr is a variable that holds an address to GPU memory i.e. a "device pointer"
- If err ≠ cudaSuccess then something went wrong

#### Free GPU memory

cudaFree(d\_ptr);

#### Nomenclature

- "Host" is the CPU
- "Device" is the GPU

### Copy data to GPU

cudaMemcpy(d\_ptr, ptr, size, cudaMemcpyHostToDevice);

- Call from host (CPU)
- Copy data on host pointed to by ptr to device at address pointed to by d\_ptr
- Device memory should have been allocated using cudaMalloc() to obtain d\_ptr

### Copy data from GPU

cudaMemcpy(ptr, d\_ptr, size, cudaMemcpyDeviceToHost);

- Call from host (CPU)
- Copy data on device pointed to by d\_ptr to host at address pointed to by ptr
- Host memory should have been allocated using e.g. malloc() to obtain ptr

#### Declare a CUDA kernel

Example:

```
__global__ void
func(int n, double a, double *x)
{
    ...
    return;
}
```

#### Call a CUDA kernel

• Call from host. Example:

func <<< nblck, nthr >>> (n, a, x);

- nthr: number of threads per block; can be scalar or a dim3 type
- nblck: number of blocks; can be scalar or a dim3 type
- Example of dim3 type:

```
dim3 nthr(1024, 8, 8); /* No. of threads in (x, y, z) */
```

#### Call a CUDA kernel

• Call from host. Example:

func <<< nblck, nthr >>> (n, a, x);

- nthr: number of threads per block; can be scalar or a dim3 type
- nblck: number of blocks; can be scalar or a dim3 type
- Example of dim3 type:

dim3 nthr(1024, 8, 8); /\* No. of threads in (x, y, z) \*/

#### Thread coordinates within kernel

#### Example:

```
__global__ void
func(int n, double a, double *x)
{
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    ...
    return;
}
```

### Threads, blocks, grids



### Threads, blocks, grids



### Threads, blocks, grids



### Threads, blocks, grids



### Threads, blocks, grids



### Threads, blocks, grids



#### Threads, blocks, grids



#### Threads, blocks, grids

dim3 blcks( 4, 3, bz); dim3 thrds(16, 8, tz); func<<<blcks, thrds>>>( ... );



#### Variables available within kernel

- threadIdx.{x,y,z}
- blockIdx.{x,y,z}
- blockDim.{x,y,z}
- gridDim.{x,y,z}

#### Exercise: port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

 $y_i \leftarrow a \cdot x_i + y_i, \ i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

• Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

$$y_i \leftarrow a \cdot x_i + y_i, \ i = 0, \dots, n-1$$

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

$$y_i \leftarrow a \cdot x_i + y_i, \ i = 0, \dots, n-1$$

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

• Allocation of memory on the GPU;

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

$$y_i \gets a \cdot x_i + y_i, \ i = 0, \dots, n-1$$

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

$$y_i \gets a \cdot x_i + y_i, \ i = 0, \dots, n-1$$

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;
- Invoking kernels;

#### **Exercise:** port a simple code to GPU and investigate performance

Exercise directory: /onyx/data/sds406f24/l05/ex01

• axpy.cu implements this course's favorite BLAS operation, "axpy":

 $y_i \leftarrow a \cdot x_i + y_i, \ i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the CPU. Uses OpenMP to multi-thread over the 8 cores (per socket) of our educational nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;
- Invoking kernels;
- Placement of threads and memory access

#### File: ex01/axpy.cu

- Contains the C program we will begin with: axpy.cu
- Even though the file extension is . cu, the program contains no CUDA. Only OpenMP
- Allocates four arrays: x0[n], x1[n], y0[n], and y1[n], with n read from the command line
- x0 and y0 are initialized to random numbers
- x1 and y1 are initialized to x0 and y0 respectively
- The program:
  - performs y0[:] = a\*x0[:] + y0[:] in the first part marked with A:
  - performs y1[:] = a\*x1[:] + y1[:] in the second part marked with B:
  - reports the timing for part A and for B
  - reports the difference between y0 and y1

#### File: ex01/axpy.cu

- Contains the C program we will begin with: axpy.cu
- Even though the file extension is .cu, the program contains no CUDA. Only OpenMP
- Allocates four arrays: x0[n], x1[n], y0[n], and y1[n], with n read from the command line
- x0 and y0 are initialized to random numbers
- x1 and y1 are initialized to x0 and y0 respectively
- The program:
  - performs y0[:] = a\*x0[:] + y0[:] in the first part marked with A:
  - performs y1[:] = a\*x1[:] + y1[:] in the second part marked with B:
  - $\circ~$  reports the timing for part A and for  ${\mbox{\tiny B}}$
  - $\circ~$  reports the difference between y0 and y1

Take some time to inspect *axpy.cu* before we compile and run

• Copy the exercise from this week's lesson directory:

[ikoutsou@front02 l05]\$ cp -r /onyx/data/sds406f24/l05/ex01 .
[ikoutsou@front02 l05]\$ cd ex01/
[ikoutsou@front02 ex01]\$ ls -1
axpy.cu

• Copy the exercise from this week's lesson directory:

```
[ikoutsou@front02 l05]$ cp -r /onyx/data/sds406f24/l05/ex01 .
[ikoutsou@front02 l05]$ cd ex01/
[ikoutsou@front02 ex01]$ ls -1
axpy.cu
```

• Compile with nvcc including OpenMP:

[ikoutsou@front02 ex01]\$ module load gompi/2023a
[ikoutsou@front02 ex01]\$ module load CUDA
[ikoutsou@front02 ex01]\$ nvcc -O3 -Xcompiler -fopenmp -o axpy axpy.cu

• -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)

#### • Copy the exercise from this week's lesson directory:

```
[ikoutsou@front02 l05]$ cp -r /onyx/data/sds406f24/l05/ex01 .
[ikoutsou@front02 l05]$ cd ex01/
[ikoutsou@front02 ex01]$ ls -1
axpy.cu
```

#### • Compile with nvcc including OpenMP:

[ikoutsou@front02 ex01]\$ module load gompi/2023a
[ikoutsou@front02 ex01]\$ module load CUDA
[ikoutsou@front02 ex01]\$ nvcc -03 -Xcompiler -fopenmp -o axpy axpy.cu

- -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)
- Run on the CPUs of a GPU node
- Use srun to run interactively, e.g.:

[ikoutsou@front02 ex01]\$ export OMP\_PROC\_BIND="close" [ikoutsou@front02 ex01]\$ export OMP\_PLACES="cores" [ikoutsou@front02 ex01]\$ export OMP\_NUM\_THREADS=16 [ikoutsou@front02 ex01]\$ srun -n 1 --cpus-per-task=16 -p p100 --gres=gpu:1 ./axpy \$((1024\*1024\*64)) CPU: nthr = 16 t0 = 0.0145 sec P = 9.232 Gflop/s B = 55.393 GB/s CPU: nthr = 16 t0 = 0.0142 sec P = 9.437 Gflop/s B = 56.620 GB/s Diff = 0.000000e+00

#### • Copy the exercise from this week's lesson directory:

```
[ikoutsou@front02 l05]$ cp -r /onyx/data/sds406f24/l05/ex01 .
[ikoutsou@front02 l05]$ cd ex01/
[ikoutsou@front02 ex01]$ ls -1
axpy.cu
```

#### • Compile with nvcc including OpenMP:

[ikoutsou@front02 ex01]\$ module load gompi/2023a
[ikoutsou@front02 ex01]\$ module load CUDA
[ikoutsou@front02 ex01]\$ nvcc -03 -Xcompiler -fopenmp -o axpy axpy.cu

- -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)
- Run on the CPUs of a GPU node
- Use srun to run interactively, e.g.:

[ikoutsou@front02 ex01]\$ export OMP\_PROC\_BIND="close" [ikoutsou@front02 ex01]\$ export OMP\_PLACES="cores" [ikoutsou@front02 ex01]\$ export OMP\_NUM\_THREADS=16 [ikoutsou@front02 ex01]\$ srun -n 1 --cpus-per-task=16 -p p100 --gres=gpu:1 ./axpy \$((1024\*1024\*64)) CPU: nthr = 16 t0 = 0.0145 sec P = 9.232 Gflop/s B = 55.393 GB/s CPU: nthr = 16 t0 = 0.0142 sec P = 9.437 Gflop/s B = 56.620 GB/s Diff = 0.000000e+00

• Compare ~56 GB/s achieved vs ~80 GB/s peak memory bandwidth (single socket)

#### Use a GPU to replace part B of the calculation

- Edits outside of main():
  - 1. Add the cuda\_runtime.h header file
  - 2. Add the GPU axpy kernel, naming it gpu\_axpy()
  - 3. Add a function similar to ualloc() that allocates memory on the GPU and checks whether an error occured
- Edits within main():
  - Allocate arrays on GPU
     Copy x1[:] and y1[:] to GPU
     Call gpu\_axpy()
     Copy y1[:] from GPU

#### Edits outside of main() 1/3

• Add the cuda\_runtime.h header file on line 5:

#include <cuda\_runtime.h>

#### Edits outside of main() 2/3

• Add the GPU axpy kernel, naming it gpu\_axpy(), after the CPU axpy, around line 64:

```
/***
 * Do y ← a*x + y on the GPU
 ***/
__global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
   for(int i=0; i<n; i++)
     y[i] = a*x[i] + y[i];
   return;
}</pre>
```

#### Edits outside of main() 3/3

• At around line 30 add a function similar to ualloc() that allocates memory on the GPU and checks whether an error occurred

```
/***
 * Allocate memory on GPU; print error if not successful
 ***/
void *
gpu_alloc(size_t size)
{
    void *ptr;
    cudaError_t err = cudaMalloc(&ptr, size);
    if(err ≠ cudaSuccess) {
        fprintf(stderr, "cudaMalloc() returned %d; quitting...\n", err);
        exit(-2);
    }
    return ptr;
}
```

#### Edits within main() 1/4

• Allocate arrays on GPU, within B part. Free arrays before closing B part:

```
/*
 * B: Run axpy(), return to y1, report performance
 */
 {
    /* Allocate GPU memory */
    float *d_x = (float *)gpu_alloc(n*sizeof(float));
    float *d_y = (float *)gpu_alloc(n*sizeof(float));
    ...
    cudaFree(d_x);
    cudaFree(d_y);
 }
```

#### Edits within main() 2/4

• Copy x1[:] and y1[:] to GPU

cudaMemcpy(d\_x, x1, sizeof(float)\*n, cudaMemcpyHostToDevice); cudaMemcpy(d\_y, y1, sizeof(float)\*n, cudaMemcpyHostToDevice);

#### Edits within main() 3/4

• Call gpu\_axpy(). For the moment use 1 thread and 1 block. Replace axpy(n, a, x, y) of part B with:

```
double t0 = stop_watch(0);
gpu_axpy <<<1, 1>>> (n, a, d_x, d_y);
t0 = stop_watch(t0);
```

Note we need to pass the *device pointers* since it is these pointers that point to the memory allocated on the GPU

#### Edits within main() 4/4

• Copy y1[:] from GPU:

```
/* Copy y1 back from GPU */
cudaMemcpy(y1, d_y, sizeof(float)*n, cudaMemcpyDeviceToHost);
```

• Also change:

```
printf(" CPU: nthr = %4d ...);
```

to:

printf(" GPU: ...);

and remove OpenMP parallel region.

#### Compile and run

• Compile as before:

[ikoutsou@front02 ex01]\$ nvcc -arch=sm\_60 -03 -Xcompiler -fopenmp -o axpy axpy.cu

- We need to specify -arch=sm\_60, the architecture of the GPUs on the cyc nodes (P100 architecture), because by default nvcc compiles against a newer GPU version, compatible with newer nodes
- Run as before (I'm assuming OMP\_BIND, OMP\_PLACES, and OMP\_NUM\_THREADS were set before):

CPU: nthr = 16 t0 = 0.0150 sec P = 8.945 Gflop/s B = 53.670 GB/s GPU: t0 = 0.0001 sec P = 2630.607 Gflop/s B = 15783.644 GB/s Diff = 1.021564e-15

#### Compile and run

• Compile as before:

[ikoutsou@front02 ex01]\$ nvcc -arch=sm\_60 -03 -Xcompiler -fopenmp -o axpy axpy.cu

- We need to specify -arch=sm\_60, the architecture of the GPUs on the cyc nodes (P100 architecture), because by default nvcc compiles against a newer GPU version, compatible with newer nodes
- Run as before (I'm assuming OMP\_BIND, OMP\_PLACES, and OMP\_NUM\_THREADS were set before):

```
CPU: nthr = 16 t0 = 0.0150 sec P = 8.945 Gflop/s B = 53.670 GB/s
GPU: t0 = 0.0001 sec P = 2630.607 Gflop/s B = 15783.644 GB/s
Diff = 1.021564e-15
```

This performance is infeasible. What's going on?

#### Edits within main() 3/4

• The problem is here:

```
double t0 = stop_watch(0);
gpu_axpy <<<1, 1>>> (n, a, d_x, d_y);
t0 = stop_watch(t0);
```

• CUDA kernels return **immediately**; the kernel is still being executed on the device when stop\_watch(t0) is called. We are **not** timing the kernel execution time, but the time it takes to dispatch the kernel to the GPU.

#### Edits within main() 3/4

• The problem is here:

```
double t0 = stop_watch(0);
gpu_axpy <<<1, 1>>> (n, a, d_x, d_y);
t0 = stop_watch(t0);
```

- CUDA kernels return **immediately**; the kernel is still being executed on the device when stop\_watch(t0) is called. We are **not** timing the kernel execution time, but the time it takes to dispatch the kernel to the GPU.
- Correct this by adding cudaDeviceSynchronize(); after the CUDA kernel, which blocks until all running CUDA kernels are complete:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 1>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

#### • Compile and run again:

#### • Compile and run again:

• This performance is of course extremely poor;

#### • Compile and run again:

- This performance is of course extremely poor;
- We're using only one GPU thread in the kernel

#### Use more threads

• In this step, we will use 512 GPU threads. First, change the call to the GPU kernel:

double t0 = stop\_watch(0); gpu\_axpy<<<1, 512>>>(n, a, d\_x, d\_y); cudaDeviceSynchronize(); t0 = stop\_watch(t0);

#### Use more threads

• In this step, we will use 512 GPU threads. First, change the call to the GPU kernel:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 512>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

• Then we need to change the kernel. We need in each GPU thread to calculate which elements it will operate on:

```
/***
 * Do y ← a*x + y on the GPU
 ***/
_global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    int lt = n/nthr;
    for(int i=ithr*lt; i<(ithr+1)*lt; i++)
      y[i] = a*x[i] + y[i];
    return;
}</pre>
```

• With the above, each thread operated on n/nthr contiguous elements

• Compile and run again:

• Better than before, but still very poor performance. Can we do better?

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations

This represents the order by which elements are accessed currently

- The same thread accesses continuous elements
- Very common approach on CPUs
- On GPUs, this results in so-called bank conflicts
- Suboptimal!

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations



This represents an optimal data access pattern

- Different threads accesses continuous elements
- Each thread is served by a different memory bank

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations

In our example:

```
/***
 * Do y ← a*x + y on the GPU
 ***/
__global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    for(int i=0; i<n; i+=nthr)
       y[i+ithr] = a*x[i+ithr] + y[i+ithr];
    return;
}</pre>
```

• Compile and run:

```
[ikoutsou@front02 ex01]$ nvcc -arch=sm_60 -03 -Xcompiler -fopenmp -o axpy axpy.cu
[ikoutsou@front02 ex01]$ srun -n 1 --cpus-per-task=16 -p p100 --gres=gpu:1 ./axpy $((1024*1024*64))
CPU: nthr = 16 t0 = 0.0150 sec P = 8.950 Gflop/s B = 53.698 GB/s
GPU: t0 = 0.0565 sec P = 2.377 Gflop/s B = 14.260 GB/s
Diff = 1.021564e-15
```

#### **Blocks and threads**

Now let's use blocks. Let's use as many blocks and threads as we can

- Upper limit of 1024 threads
- Upper limit of  $2^{31} 1$  blocks (practically infinite)

```
double t0 = stop_watch(0);
int nthr = 1024;
gpu_axpy<<<n/nthr, nthr>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

```
/***
 * Do y ← a*x + y on the GPU
 ***/
__global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    int iblk = blockIdx.x;
    int idx = ithr + iblk*nthr;
    y[idx] = a*x[idx] + y[idx];
    return;
}
```

#### **Blocks and threads**

• Compile and run:

• ~520 GB/s is ~70% of peak bandwidth (which is 732 GB/s)

#### **Blocks and threads**

• Compile and run:

- ~520 GB/s is ~70% of peak bandwidth (which is 732 GB/s)
- Try varying the number of threads per block. E.g. read it from the command line and scan for the optimal value.