GPU Function in C++ | Learn Tech with Rookie

type

status

date

slug

summary

📝Heterogeneous Systems

In modern accelerated calculators, CPU is used to distribute calculating task to GPU, then GPU start to run its task(and CPU can still work when GPU run), and finally CPU collect all result and output.

##I will cite some book oneday here##

Difference between code on CPU and GPU

Here is a code segment of .cu file：

Then we run a simple .cu

We can use nvcc to compile this:

-arch is used to restrict the compiling architecture(sm_70 is from Nvidia learn lab)

Parallel running kernel function

This picture is from official Nvidia slide:

Each block has the same number of threads, in above picture, 2 blocks within each 4 threads.

All kernel function(we call them as “GPU Function” previously) are runned in the same time.

But it has some problems caused by physical achievement of GPU (the order of output can not be controlled right now. I may talk about it in future notes.)

Notice to get the condition statement(threadIdx.x == 1023 && blockIdx.x == 255), we choose <<<256, 1024>>> becase the element of array begin from 0

Accelerating ’for‘ loop

In above code, we achieve parallel acceleration by replacing iteration to ThreadIdx.x

What if we want to map a vector(such as integer 0~7) to blocks(such as 2 blocks and each has 4 threads)?

In our example, we have blockDim = 4

Integer 6 = 2 + 1*4

As we can see, the order of outpu is a mess.

Memory Allocation and Deallocation

Global pointer is just replacing malloc and free by cudaMallocManaged and cudaFree .

Example: double each integer in an int-array.

What if the number of element in the vector is smaller than total number of threads？

引用的话语

观点2

引用的话语

🤗 总结归纳

总结文章的内容

📎 参考文章

一些引用

引用文章

💡

有关Notion安装或者使用上的问题，欢迎您在底部评论区留言，一起交流~