Main - Blog

Playing with CUDA
Sun Dec 28 12:07:22 UTC 2008

I decided to take a look on CUDA, NVIDIA's technology for parallel programming using their GPUs from G80 (GeForce 8x00 and later). I never had any experience with parallel programming, mostly because i always used a single-core system. Recently i wrote a couple of threaded applications, which provided a nice speedup, but the true performance beast in a modern system is the GPU. Before CUDA, if you wanted to program a GPU you had to write pixel shaders and do some hacks to send data to the card and read data from it. With CUDA you don't need all these hacks: you program the card -almost- directly using a simple API and the C language. The C (which actually is C++-ified C and probably works only in Windows since in the Linux version no matter what backend language i specified it always used the C++ one) part runs on the CPU and only a subset of C is available on the GPU, although this seems only a temporary technical limitation of today's cards than something that will be around forever. But since C wasn't designed for parallel programming, CUDA provides a few (but thankfully only a few) extensions to the language. You simply "mark" the functions and variables that will run in the GPU and when you make a call to these functions you provide some extra information about how the function will be executed. The call to this function is then done asynchronously in the GPU.


In general its a nice and mostly environment to work. Although this cleanness has a lot of dirt beneath it: the nvcc compiler -the compiler you use to compile CUDA source code- is basically some sort of electronic duct tape that keeps everything together. It works -and thats the important part- but i can't shake the feeling that the system is fragile. The thing is, when you run nvcc it splits your code in several .c/.cpp files and some of them will run in the CPU and some in the GPU. The GPU parts are convertex in ptx assembly and compiled in cubin, a virtual machine bytecode that the driver probably converts to GPU instructions, and the CPU parts are compiled with an OS-specific compiler (Microsoft's compiler in Windows and GCC in Linux and probably Mac OS X), linked to a library (cudart) which uses a so-called "driver api" to load the cubin files. Your calls to GPU functions seem to be converted in driver api calls that run these cubin files. Note that i don't mean that this design is a flaw, but something i believe that was done in a mostly hack-ish way. In any case if you don't want to use the extended CUDA C, you can use the driver API directly and use any language you want. In this case however things become much harder because you lose the ability to mix GPU and CPU calls, you need to externally compile the parts that will be executed in the GPU and you cannot use the higher level API at all.


Of course thats where the negative (if you can call them that) parts of the impression i've got from CUDA so far end. Some things might be from my misunderstandings of the system (and i already had my share of those). In general CUDA is a very interesting technology because it opens the world of parallel programming to anyone - anyone who has a recent NVIDIA GPU at last. But as NVIDIA's presentation say, a GTX280 is basically a "$599 supercomputer" so if you're interested the entry price is very small (as long as you have a computer strong enough to handle the card of course). Also the price has dropped since when this was written - in local shops i can find a GTX280 with 350 euros (~$490).


I began my CUDA trip by reading a very good tutorial on Dr.Dobb's about CUDA by Rob Farber. The tutorial is laid out in a way i prefer: on-screen results first, explanation second. Once i did the examples mentioned there (with a few modifications of course and after i made a jEdit CUDA syntax highlighting file and a SCons tool for CUDA), i started reading CUDA's Programming Guide. The guide is very well written, in fact one of the best written guides i've read. I read it in a single "stroke" which took me a few hours. Although this means that i missed some details (especially near the end when i got a bit tired), i decided to do it this way so i can have an overall idea about what all is about. Once i finished that, i decided to get heads first into programming a CUDA program. An obvious choice for a highly parallel system is to program a raytracer. Or at least it was obvious to me who programs 3D graphics for years. Anyway. After messing a lot with the reference and the CUDA forums (where really helpful people hang out!), i made the little raytracer you can see at the left - only a bit slower. The first version of the raytracer used glDrawPixels which was slower -about 450fps- while the one at the left uses a texture.


While i was making this program, i stumped on a problem: when i tried to add antialiasing by firing extra rays, the kernel (the part that runs on the GPU hundred of times) stopped working! Initially i though that i made the kernel too big (as it was a common problem when i was programming pixel shaders). So i removed the AA part and i made a post on CUDA forums about the raytracer and asking if the kernels are something more primitive than shaders. It came out that my problem was that i ran out of registers! You see you program the GPU by writing a small function, the kernel. This kernel runs in a -lightweight- thread and each thread runs in a block of threads and each block runs in a grid of blocks. Each time you do a call to a kernel function, you specify the grid size, the block size and some other optional stuff. Well each block has a predefined set of registers - 8192 for G80, 16384 for GT200. A great number really? Thats what i thought, but i missed the part in the documentation that said that these registers are shared between the threads. My initial execution configuration was 512 blocks of 512 threads to each block - basically using one line per block and one pixel per thread. This configuration left only 32 (16384 block registers / 512 threads) registers per thread in my GT200-based CPU. Before adding the extra ray, the nvcc compiler generated code that used 30 registers per thread. After adding it, it generated code that used 40 registers per thread (i assume before the ray the compiler optimized some variable references, something which it couldn't do with the second call to the ray function). So what i initially thought as command count limitation (while the guide says that a kernel can have up to 2 million commands, i'm not sure how many ptx commands the nvcc compilers generates per line so i didn't ruled out this case) was actually a register count limitation. Solving the issue wasn't hard: i just had to run the compiler with the --ptxas-options=-v argument to see how many registers per thread it uses and modify the code so that it uses less than 32 registers. I did that and indeed it worked. Then i added antialiasing, color and even a second sphere to the rendering.


But so far the only feature that i used that raycasters are known is a shadow from the lightsource. But raycasters are also known for their nice reflections. So i had to add reflections. An issue with CUDA is that the functions that run on the GPU aren't recursive because there is no real stack. The function arguments are passed on static shared memory and that was it. If you want recursion you have to do it manually by using a part of the shared (or global, but thats slow) memory as a stack and writing a loop that pushes/pops data as needed to this stack. Fortunatelly adding reflections didn't needed any stack, so i only made a loop that runs for a variable (specified in host/cpu side) number of iterations and modifies the current data for the next iteration. At this point, using one iteration was the same as before, that is the render without reflections. Adding a second iteration made the kernel not run. I hit the same issue with the register count again. So i thought that i have to get rid of this issue and use less threads per block. After messing a bit with the code (and a crash because i calculated something badly and made the GPU to write somewhere it shouldn't - a mistake that made my system to hang), i broke the single call to the kernel to four calls with grids of 512 blocks each one having 128 threads. With 128 threads per block now i had 128 registers per thread, a number that provided a much more comfortable environment for the kernel - and also the program would run in G80 hardware too since 64 registers is more than enough for what it was doing. I tried the code and finally it worked - i could see reflections! Although they weren't correct. But that wasn't a CUDA issue so i won't go into details - i just did a miscalculation of the reflected ray directions. For some reason the miscalculated directions produced believable results for the spheres, but not for the plane. However after a while, i found the bug and squashed it for good.


With this working, i decided to see how the CUDA-based raytracer would compare with a CPU-based raytracer. Initially i though to use the CUDA emulation library. I compiled the program with the emulation enabled and executed it. The results were puzzling: less than one fps, even without reflections and antialiasing. I looked in the forums and, as one would guess, people said that using the emulation has a great performance penalty and is not a recommended to benchmark the difference between CPU and GPU code. Fortunatelly CUDA allows you to define that a function can be executed in both CPU and GPU, so i declared all functions that were to be executed in GPU as being able to execute in CPU too and moved the code from the main kernel function to a different function that the kernel calls. Then i wrote a loop that emulates how the kernel function would call the other function, did some extra work here and there to have things shown in display, etc and voila: the CPU version of the raytracer was perfectly working with exactly the same results as the CUDA version. The speed wasn't that bad - with antialiasing disabled and no reflections i had about 14 fps. Adding extra reflection bounces had a performance hit, but the big hits were in antialiasing. At this point i had about 3fps in default configuration in CPU and 437-440fps in CUDA. The speedup provided by CUDA is remarkable, although i should mention that the CPU version didn't took advantage of my quadcore since only one core was used. I believe those 3fps would increase to about 10-12fps if a multithreaded version was used. But even so, the difference between 12fps and 440fps is huge. And thats not all.


In the forums people told me that i can replace the four kernel calls to one. Up to that point i was thinking that the maximum number of groups i can use is 512 - but that is not right: i mixed the maximum number of groups per grid with the maximum number of threads per group! As it turns out the group grid can be up to 65535x65535 and the 512 number was from the maximum number of threads per group - that means a group can have up to 512 threads. So i modified the code to use 2048 groups of 128 threads each - basically lumping the four calls in one. I saw a very slight speedup, so i thought: the more groups the faster!!1 (yes i thought that one too). So i made that 2048x128 to 4096x64. Hurray another speedup! Next step was to 16384x16. Oops, 304fps. Whats wrong here? From what i understood so far, GPU when the threads in each block do the same stuff and access the memory in patterns the GPU likes, everything is executed concurrently, the flow is optimal and the world is at peace. If however this isn't the case - like if every thread has its own flow - the scheduler sorts the threads based on their flow (so threads with the same flow are grouped together) and then the threads are executed in order based on their flows. This means that the execution is going to be slower than if they were executed in parallel. To avoid the effect of this, you can use more groups with less threads since each group is mostly independent from the other groups so more threads will be executed in parallel (note: i'm not 100% about this, but so far this is what i've understood - if i find that i'm wrong i will add an update at the end of this entry). That explains why adding more groups and less threads made the raytracer faster. However when i used 16384x16 (and later 8192x32) the rendering was still much slower. The reason is simple - and mentioned in the programming guide at two places: the scheduler likes 64 and its multiples. It performs best if you use 64 or a multiple of it and bad if you use something else. In fact its highly recommended to use at least 64 threads. Also not having too few number of threads has another speed benefit: the GPU has a bunch multiprocessors (a GTX280 has 30 of them) and each of them executes some thread groups. Although local-to-multiprocessor memory (registers and shared memory) is considerably faster than on-pcb memory (global, local and texture memory), it still needs some time. Having many threads (the guide recommends at least 192) active on each one doing different stuff hides these delays because while one thread accesses the memory another might do some computations and vice versa.


So after trying around with a few numbers, i decided to keep the 4096x64 one. Its basically the fastest and it complies with the '64 or a multiple of it' for the number of threads. With that configuration the raycaster now reaches ~560fps, an increase of ~120fps only by modifying some numbers! As the guide states, to find the best numbers you need to experiment. The CUDA SDK includes an Excel spreadsheet (seems to work in OpenOffice too) that you can use to calculate the best numbers, although i didn't used it. I might try it at some point later though.


Working with parallel system that has 240 cores with CUDA is a little different to working with a single core system. It requires a different thinking and to develop the skill to think in terms of parallel programming if you want to take advantage of the system. A raycaster is a no-brainer for such a system but i'm interested in applying other problems to parallel computing. Since i'm intersted mostly in graphics and related algorithms probably it will be easier than -say- applying a problem related to databases or financing, although i'm sure there are places where parallel systems can be used there too. While currently i don't see CUDA being used in end-user applications -it requires some of the latest cards in the market from a single manufacturer who only has a 30-35% market share- and i cannot see it becoming something more than a "GLIDE for parallel systems", it is the only working, documented and tested system dedicated to parallel programming and available in consumer systems and basically to bedroom programmers like you and me. While the fine details probably won't be the same, once such systems become standarized through open standars -like OpenGL- and supported by many manufacturers anyone who used CUDA will be able to convert to these systems with little effort. In fact, the new proposed (and accepted by Khronos) OpenCL -Open Compute Language- seems to be a move to this direction and according to several people it looks a lot like CUDA. However OpenCL is still very young and not supported by anyone except Apple who said it will be available in the next version of their Mac OS X (Apple basically designed it). I hope Khronos will handle OpenCL better than they handled OpenGL so we can have true support in all mainstream systems for a common parallel computing language because i'm sure Microsoft isn't going to let this slip through their hands and make their own incompatible version, available only to Windows and leading to yet another OpenGL and Direct3D situation.


In the meanwhile personally i'm going to experiment more with CUDA and see what i can do with this raytracer, find what can be done in CUDA in general and try to make a couple of other programs i have in mind that could (or could not - i'll find out) take advantage of a parallel system. Also CUDA basically is the first step in a future system of graphics where everything is programmable and you can implement your own rendering algorithms. Really what is even more interesting about CUDA is that you have very few -temporary, according to a NVISION presentation- limitations and a great number of possibilities on what you can do with the system.


Some of the methods i want to test in CUDA are rendering CSG solids directly, volumetric/voxel rendering, heighmap raycasting, worlds based on intersections of inverted spheres (where the interior of a sphere is empty and the outside area is solid), etc.


You can find the raytracer i made here, although the version there can be newer than the one i describe above.


Post your comments in the forum


A New Blog
Wed Nov 19 21:51:40 UTC 2008

I decided to start a new blog in badsectoracula.com. I didn't tried to do that before because badsectoracula.com doesn't use PHP, MySQL or other dynamic stuff, but offline generated HTML pages using a custom program i wrote in FreePascal, so i would need to implement a blog system in this system. I avoided to do it for a while, but now i decided to sit down and modify the site generator to support a blog. Since the system is based on offline generation of the HTML files, it cannot support comments. However a substitute is available: the site's forum (which, btw, is based on Mini Forum, my own forum software written as a CGI in Free Pascal too).
Post your comments in the forum


Copyright © 2007-2010 Kostas Michalopoulos