Compute Capabilities and Thoughputs on NVIDIA’s GPUs

Summary In this post, I will introduce the thoughputs and compute capabilities on NVIDIA’s GPUs. The post doesn’t contain hardware details. Conclusion It might be a common sense that half precision floats will run faster on GPUs, like this post by Intel. However, it is a different story on NVIDIA’s GPUs. For example, you may…

How to debug Async Kernels or APIs in CUDA

Summary In this post, I will introduce how to debug async kernels or async APIs in CUDA. The async operations will not block CPU codes. When we check the return type of the functions calls, it may be SUCCESS but there are bugs like "illegal memory access". On the other hand, when we find the…

Sync and Async in CUDA

Summary In this post, I will introduce the Sync and Async behaviors in CUDA. Conclusion The followings are handy codes testing the behaviors of CPU and streams. __global__ void cuda_hello1(){ clock_block(10000); printf("Hello World from GPU1!\n"); } __global__ void cuda_hello2(){ printf("Hello World from GPU2!\n"); clock_block(10000); } void cpu_hello() { printf("hello world from cpu?\n"); } /* hello…

Thread Pool in C++

Summary In this post, I will introduce how to build a simple thread pool in C++. Conclusion The codes are from here. The thread pool only uses thread, mutex, and condition_variable. #include <thread> #include <mutex> #include <condition_variable> class ThreadPool; // our worker thread objects class Worker { public: Worker(ThreadPool &s) : pool(s) { } void…

Mutex in C++ and Java

Summary In this post, I will introduce the correct way to use mutex in C++, compared to Java. Conclusion Try not to use std::mutex directly. Use std::unique_lock, std::lock_guard, or std::scoped_lock (since C++17) to manage locking in a more exception-safe manner. Undefined behaviors will happen if a). A mutex is destroyed while still owned by any…

Profile Applications in CUDA

Summary In this post, I will introduce how to use the tool nvprof to profile your CUDA applications. Details It is a good practice to dive deeper to see how much time each kernel or each CUDA runtime API takes when you want to optimize your applications. Intuition It is not good to use any…

Install CUDA 10.1 and Driver 418

Summary In this post, I will introduce how to install the newest CUDA and corresponding Nvidia driver in Ubuntu 16.04. Details I want to use CUDA for neural network inference. But after I compile the executable files and run, it tells me driver not compatible with this version of CUDA. I have GTX 1060 and…

Intel Threading Building Blocks

Summary In this post, I will introduce how to solve a parallel computation task using Intel Threading Building Blocks. Problem In the deep learning platform, given inputs contains several thousand images, we want to analyze the data path of a certain deep learning model. The analysis part of each image is identical, for example, we…