Compute Capabilities and Thoughputs on NVIDIA’s GPUs

Summary In this post, I will introduce the thoughputs and compute capabilities on NVIDIA’s GPUs. The post doesn’t contain hardware details. Conclusion It might be a common sense that half precision floats will run faster on GPUs, like this post by Intel. However, it is a different story on NVIDIA’s GPUs. For example, you may…

How to debug Async Kernels or APIs in CUDA

Summary In this post, I will introduce how to debug async kernels or async APIs in CUDA. The async operations will not block CPU codes. When we check the return type of the functions calls, it may be SUCCESS but there are bugs like "illegal memory access". On the other hand, when we find the…

Sync and Async in CUDA

Summary In this post, I will introduce the Sync and Async behaviors in CUDA. Conclusion The followings are handy codes testing the behaviors of CPU and streams. __global__ void cuda_hello1(){ clock_block(10000); printf("Hello World from GPU1!\n"); } __global__ void cuda_hello2(){ printf("Hello World from GPU2!\n"); clock_block(10000); } void cpu_hello() { printf("hello world from cpu?\n"); } /* hello…

Profile Applications in CUDA

Summary In this post, I will introduce how to use the tool nvprof to profile your CUDA applications. Details It is a good practice to dive deeper to see how much time each kernel or each CUDA runtime API takes when you want to optimize your applications. Intuition It is not good to use any…

Install CUDA 10.1 and Driver 418

Summary In this post, I will introduce how to install the newest CUDA and corresponding Nvidia driver in Ubuntu 16.04. Details I want to use CUDA for neural network inference. But after I compile the executable files and run, it tells me driver not compatible with this version of CUDA. I have GTX 1060 and…