The idea for this blog is help users get familiar with CUDA terminology as well as learn GPGPU programming. I also intent to show basics of NVidia NSight for debugging CUDA kernels. Let’s dive & understand CUDA terminology. Having some familiarity with C++ & Threading (on CPU) should help.
Basic GPU constructs
A GPU would consists of multiple SM (streaming multiprocessors), SM contains SPs (Streaming processors), SM is responsible for running multiple thread blocks. SM have shared memory that is divided between various blocks that are running on these SMs. SM schedules thread to run on SPs, each SP runs once thread and hence also gets its own set of registers.
Below diagram should be useful

Host vs Device
Host is usually referred to CPU while Device is GPU. Usually host & device have separate memories. Some GPUs (Pascal/Tesla GPU) have new features called unified memory where programmer doesn’t needs to worry about data transfer from host to device.
Kernel
A kernel is just a function that can be executed in parallel, think about this as stateless functor that can achieve data parallelism. It’s similar to vertex or fragment shader in OpenGL/Vulkan
Warp
A warp is cuda construct that refers to threads from a block, a warp on NVIDIA’s GPU usually consist of 32 threads. The warp must run on same SM. Each thread on warp are bound together, what that means is that each of them will execute all instructions in a given code, for e.g. if you have if/else block then all of thread in this warp will execute each instruction produced by this code irrespective of which condition is met. Threads in a warp fetch data from memory at same time (in same cycle), this is important since if all threads are fetching memory from same memory segment then your cost would be lower compared to each thread fetching from a random location (multiple memory segments).
You can think of Warp as a basic scheduling units in SM, SM is responsible for launching/scheduling warps.
Remember an instruction is issued per warp (& not per thread).
Thread Block
A thread block is group of threads (or rather threads from a warp) that can be scheduled on a SM, it can be either 1D, 2D or 3D array.
Any number of blocks can be assigned to SM, but an SM can only execute/schedule single warp at a time. For e.g. let’s say we have 4 blocks assigned to a SM, and let’s say each block was given 128 threads, the number of warps that would be created for each block would be 4 (128/32), since we have 4 blocks we will have total of 16 (4*4) warps. At a certain point in time, any one of these warps can be executed/scheduled on this SM.
SM are very fast at scheduling, it’s called zero overhead warp scheduling. Warps which are stalled are replaced by warps which are ready. This keeps GPU cores productive even if one warp for stalled (let’s say for memory access).
The maximum number of thread per block can’t be more than 1024, usually a cuda kernel will let you specify 1024 on X Dim and 1024 on Y Dim and 64 on Z Dim as maximum block dimensions but that doesn’t means you can have 1024x1024x64 threads, the maximum threads can’t exceed 1024.
Grid
A grid is simply a combination of blocks, a cuda kernel is executed as grid of thread blocks. All of these threads can share global memory (which is very slow). As a programmer, you express grid by providing grid size/dimension, block shape (1d,2d,3d) and block dimensions (number of threads in each block) when executing a cuda kernel.
A cuda kernel can be executed by foo<<<gridDimension, blockDimension>>>(parameters);

Thread/Memory
Each thread have a unique id within it’s block and runs on a SP. CUDA memory have special register to store threadIdx, blockIdx, blockDim, gridDim. These are useful & required to calculate unique or local threadids.
Please note that Thread/Block/Kernel are software constructs while Lane/Warp/SM/Device are hardware constructs.
Debugging using NSight
To Debug cuda code, we need to either use cuda-gdb or nsight, we will focus on nsight in this tutorial.
Nvidia Nsight integrates with visual studio seamlessly and makes it pretty easy to debug cuda code. As a first step we will look into system info as provided by nsight.
Below screenshot shows 2080 have 48 SMs, it also tells I have 8 GB of GPU RAM, the frame buffer bandwidth tells you the speed at which copy will happen on GPU itself i.e. device to device.

To Debug cuda code, you must launch using Nvidia nsight’s extension (start cuda debugging).
Let’s look at a screenshot from autos view of VS

You will see blockDim is 128,1,1 i.e. we are creating blocks that would have 128 threads each, so these blocks would have 4 warps (128/32).
Each of the breakpoint is always showing the state at one active thread, for e.g. in above screenshot, we see that active thread is 0,0,0 (threadIdx) in block 0,0,0 (blockIdx).
Let’s now jump in Nsight’s warp info window (Extensions->Nsight->windows->warp info)

A warp info let’s you view all warps (32 thread chunks) that are available to run on gpu for current kernel.
The shaderinfo column shows blockidx & threadidx for each warp. The threadidx being shown is for the first thread on that warp. CTA refers to set of blocks & warps.
Let’s now look at warp watch

A warp info can help you see the state of various threads and corresponding values that are computed at runtime.
This brings us to end of the blog, CUDA is vast & have many other concepts such as shared memory, atomics, textures etc.