There is one more tool that I want to cover in a little bit more detail before presenting a round-up of the best of the rest (there really are so many good ones out there that this mini-series could last for months or years if I wrote a post for each one). The last article presented the CLRProfiler, a tool to help you manage your garbage and ensure it is being collected properly. Careless garbage collection is often the cause of poor performance on the CPU – but the CPU is only half the story, and to find out what’s causing poor performance on the GPU, you will need to use PIX (available free as part of the DirectX SDK).
Those of you who downloaded and used my Kensei Dev library might have noticed this comment in the Dev Shapes code:
// TODO I have noted some performance issues with this code when drawing very large
// numbers of shapes, but have not had time to profile it and fix it up yet, sorry!
Recently I had a bit of spare time so decided to go back and revisit this section. I set up a very simple test within Pandemonium, to draw lots and lots of spheres at random positions. I discovered that drawing 1000 spheres, or about 860,000 triangles (which doesn’t seem that many to me), caused the frame rate to plummet to only about 6Hz:
Using tricks that I’ve covered and linked to previously it didn’t take long to determine that the GPU was the bottleneck. (For example, returning early out of the game’s Update method, therefore dropping CPU usage as close to zero as possible, had zero effect on the frame rate). So my next port of call was PIX itself.
PIX (originally an acronym for Performance Investigator for Xbox) is an immensely powerful tool and we’re only going to scratch the surface of it here. At the most basic level, you can think of it as a recorder for absolutely everything that happens on the GPU. You can see exactly when every single function that used the GPU was called, and exactly how long it took. You can even rebuild a frame of your game method call by method call, seeing the results rendered step by step, instead of within a sixtieth of a second.
In this case, I want to see which functions are taking so long within a frame. I therefore chose to sample a single frame, as all frames are likely to be pretty much equal in this case. (If, for example, I was seeing a generally solid frame rate with occasional stutters, I’d have had to have chosen a different option).
After starting the experiment and getting to a point where the frame rate was low, I hit F12 to capture a frame (this can take a second or two). After I’d shut down my game, PIX generated a report:
There’s quite a lot going on in this image so let’s take a look at each section in turn.
The top window shows a graphical timeline. It’s not obvious from this picture, but you’ll see it very clearly when you run PIX for yourself, that the bars on the timeline indicate time when the GPU and CPU are busy doing things. As you click along the timeline, the arrows indicate where the GPU and CPU synchronise to the same call. With some classes of performance problem, you’ll see big gaps in one or other processor – these indicate whether you are CPU or GPU bound, for example, if you are GPU bound, you’ll see gaps in the timeline for the CPU where it was waiting for the GPU to catch up. The red circle in the top right of the picture shows the range of calls within our sampled frame (which occurred about 48 seconds in) – it looks mostly empty in the screenshot, but zooming reveals more details.
The middle window shows the DirectX resources in use (remember, XNA is just a layer on top of DirectX) including pixel and vertex shaders, vertex buffers, surfaces and such like. Not of much interest to us at this point.
In the bottom right I’ve selected the Render window. This shows us a preview of the frame as it was constructed. As you advance the cursor along the timeline, this preview is updated – initially getting cleared to Cornflower Blue, then having more and more things drawn onto it. This can be invaluable for detecting overdraw, and is really interesting in its own right. One of my favourite features is the ability to “Debug This Pixel”, which shows every call that affected the colour of any given pixel in the frame. This kind of thing is very useful when investigating transparencies, occluders, quadtrees etc.
Finally, in the bottom left is a list of GPU events, in sequence. Here you can see every call made to the GPU during the sample (note how they are all Direct3D calls, as mentioned above). Using the timeline view, I was able to visually identify which function call was the most expensive. Clicking on that call in the timeline synchronised it in the events window. I’ve circled the call in question. You can see from the StartTime of each event that the call to IDirect3DDevice9::DrawPrimitiveUP took 107349677 nanoseconds, or 107 milliseconds. When you consider that a whole frame normally completes in just 17 or 33 milliseconds, this one function call taking 107ms is a massive limiting factor on my frame rate.
Using a combination of intuition, logic, common sense, and the Render window (clicking on the call previous to DrawPrimitiveUP removed all the spheres from the preview, so it’s obvious what it was drawing!) I identified the corresponding code in my XNA program:
0, s_triangle3DVertices.Count / 3 );
You may not think this tells me very much. I already knew that the Kensei.Dev rendering code was slow, that’s why I fired up PIX in the first place! In fact, this is hugely valuable information. I know exactly which line of code is causing my GPU to run like a dog with no legs.
As this is a call to DrawUserPrimitives, it seems likely that the reason for this call being so slow lies in the User part of the method name. That is to say, the Kensei.Dev code builds up an array of vertices (s_triangle3DVerticesArray) each frame, and passes that into the function. This involves copying all those 860,000 triangles from main memory into the GPU memory, and is in contrast to using a vertex buffer, which lives on the GPU. If I can find a way to use native GPU resources and avoid the User methods, I may get a substantial speed boost; on the other hand, the User methods exist for the very usage scenario I’m using here, which is of vertices that can arbitrarily change position from frame to frame and which are controlled by the CPU.
Alternatively, it was suggested on the XNA Creators forums that I may be expecting the GPU to do too much in one go, and that splitting up the calls into smaller batches may improve performance. This is somewhat contrary to my understanding of modern GPUs, which, I was led to believe, vastly prefer to perform fewer operations on larger datasets than more operations on smaller datasets; nevertheless I am far from a GPU expert so will be taking that advice, and experimenting with splitting the vertex array/buffer into smaller pieces to see if this improves matters.
There are a few more possibilities as well. I’d like to say this story has a happy ending, but it doesn’t, at least not yet – I am hopeful for the future. I am still trying to solve this problem and find how to avoid this bottleneck. However, whenever investigating performance it is absolutely essential to base your observations and lines of inquiry on hard evidence. At the beginning of this article, I knew that “something in Kensei.Dev is slow”. PIX has since revealed that “DrawUserPrimitives is taking over 100ms to draw 860,000 triangles”. This will allow me to precisely focus my efforts, and hopefully find a correct, performant fix for the problem.
PIX has an awful lot more to offer than just single-frame samples and as your game nears completion you will probably find a lot of value in it. There are a lot of bugs that simply can’t be solved any other way, and if you are doing anything remotely clever with your graphics I strongly encourage you to learn about what PIX can do for you.
“My mistress’ eyes are nothing like the sun…” – William Shakespeare