Friday, December 21, 2007

Updated OpenGL Performance Lore

Performance Lore (e.g. "I did X on my app and it got faster so you should do Y") is inherently dangerous in that performance tuning without performance analysis is a bad idea. With that in mind, it does have a use, I think, in at least pointing out possible areas of exploration.

With that in mind, some notes on X-Plane 9 performance:
  • It doesn't take a very complex pixel shader to become fill-rate bound! In version 8 (shaders were almost like the fixed function pipeline) X-Plane was almost always limited by the CPU and our ability to find and emit batches, even at high FSAA. With X-Plane 9, even on high end cards, even modest sized shaders are the limiting factor. Never underestimate the cost of doing something ten or twenty million times!
  • Overdraw is bad! As soon as you get into shaders, you become fill-rate bound and then overdraw becomes so expensive. So using blending and glPolygonOffset to composite into the frame buffer becomes a bad idea.
  • Geometry performance: putting index buffers into VBOs gained us some speed.
  • Optimizing down the number of vertices by more aggressively using indexing gained some speed, and never hurt, even if the mesh was a bit out of order, even on older hardware.
  • Using 16-bit indices on Mac made no difference - perhaps they're not natively supported?
  • Using the "static" VBO hint instead of "stream" for non-repeating geometry gave a tiny framerate improvement in a few cases, but mostly caused massive performance failures.
A note on this last point: the driver essentially has to guess a resource usage pattern from your API calls. A lot of them will use a most-recently-used purge scheme, that is, once you run out of VRAM you throw out the most recent thing you used and replace it. If we used a real FIFO, we'd guarantee the worst case: we would flush and reload every resource into VRAM every frame, so MRU makes sense.

The stream vs. static VBO hint will usually get your VBO into VRAM (static) or simply use it by mapping via AGP memory (stream). If your total working set is larger than VRAM (something's going to go over the bus) and a VBO is used only once, stream is the best choice; save VRAM for things that are used repeatedly, like repeating meshes and textures.

So when I tried jamming everything into VRAM, we ran out of VRAM in a big way and started to thrash...MRU works well when we're over budget by a tiny amount, but for a huge shortfall we just end up with thrash.

None of this is really surprising; people have been telling us we'd be shader-bound for years...it just took us a while to get there.

No comments:

Post a Comment