20 March, 2011

Time to know your code

Ever wanted to know which parts of your codebase that keeps you from reaching the nirvana of 60 fps smoothness? Tired of not being able to pinpoint and replicate the use-case which causes your game to display strange frame time patterns on your artists' machines? Look no further.

This is the first post in a series of four where I will go through the steps of building a simple but useful profiling library for multithread/multicore games. It's not a profiler on instruction level like VTune and similar tools, but rather a library for measuring and capturing profiling data on a block/function level. It will also provide the functionality to record and analyze performance remotely through a network or by capturing the data stream to a file. The reasoning for reinventing yet another wheel is that I wanted something unobtrusive that was easy to use and had no dependencies.

In general, I want to ask fellow programmers out there who make their code available to the public to please make the effort to cut down on the dependencies. Using other libraries is all well and good, but it makes using your code so much more painful, especially when these libraries in turn have other dependencies. I've seen countless examples of useful bits and pieces of code on the net with horrible dependency chains (Boost being my pet peeve) which makes them more or less useless in my book. After all, if you released the code you probably want people to use it, so why make it more painful than it has to be?

Ok, rant over. My goals for this exercise:
  • Simple and easy to use API
  • Written in C90 for portability and ease of use in other languages
  • No external dependencies except the standard C runtime and OS provided libraries
This first part of the series will focus on solving the core issues of any profiling code, which is the problem of measuring elapsed time. All code for this post (and eventually the entire profiling library) will be released to the public domain through a github repository available at https://github.com/rampantpixels/timer_lib

Since we're working on block/function level, I decided that the high frequency counters available through QueryPerformanceCounter on Windows and clock_gettime on Linux would do. However, there's a slight problem with these functions on some faulty hardware/software that causes the value returned to jump several seconds when the thread migrates between CPUs/cores. To solve this, we can either force the thread to a certain core before calling the function (with SetThreadAffinityMask on Windows), or we can detect this case by comparing the elapsed time with a low resolution reference value which is guaranteed to give consistent results even during core migration. More about this later.

The code is written for Windows and Linux at the moment, but I will update it with MacOS X implementation as soon as I can (basically when I can get my MacBook back, it's currently on holiday at another company). If someone would like to provide implementations for other platforms, just drop me a line and I'll give you write access to the repository.

Time to pin down the API!
 void         timer_lib_initialize();  
 void         timer_lib_shutdown();  
 void         timer_initialize( timer* t );  
 void         timer_reset( timer* t );  
 deltatime_t  timer_elapsed( timer* t, int reset );  
 tick_t       timer_elapsed_ticks( timer* t, int reset );  
 tick_t       timer_ticks_per_second( timer* t );  
 tick_t       timer_current();  
 tick_t       timer_current_ticks_per_second();  
 tick_t       timer_system();  
The main entry points for the library are the initialize and shutdown functions, responsible for setting up any OS specific resources for the timing. Measuring elapsed time is mainly done through a simple timer struct
 typedef struct  
     tick_t       clock;  
     tick_t       ref;  
     tick_t       freq;  
     deltatime_t  oofreq;  
 } timer;  
It keeps the last measured timestamp, a reference value (used to detect the anomaly discussed earlier) and the frequency of the measured ticks. The timer_intialize and timer_reset will store the current timestamp which results in the reported elapsed time being reset to zero. To avoid having to deal with memory allocations, the library requires the user to allocate/free the timer objects, which follows from the mantra of keeping it simple and reducing dependencies to a minimum. Since most engines and frameworks usually have their own memory allocation wrappers, I prefer to make simple libraries like this stay away from that issue instead of having the user pass in function pointers to custom allocators.

There are two methods to query the time elapsed. timer_elapsed which returns elapsed time in seconds, and timer_elapsed_ticks which returns raw "ticks" elapsed. The timer_ticks_per_second can be used to query how many of these "ticks" there are per second.

Finally there are a couple of convenience functions, timer_current which returns a process-wide timestamp (basically an non-resettable global timer) and timer_system which returns a UNIX timestamp (number of milliseconds elapsed since the epoch).

Regarding implementation it's all pretty straightforward, except for the slight problem I mentioned earlier. Since the reported elapsed tick count can jump unexpectedly when the thread migrates between cores, I've chosen to use timeGetTime as a reference value and falling back on it in case the performance counter is messed up. This of course introduces yet another problem: The timeGetTime returns a 32 bit value which wraps around. If that happens, I just assume the performance counter is behaving during that specific call. This is of course Windows-specific, but so far I've not been able to detect a case where clock_gettime bails out.
 tick_t timer_elapsed_ticks( timer* time, int reset )  
     tick_t dt = 0;  
 #if defined( _WIN32 ) || defined( _WIN64 )  
     tick_t diff, refdiff;  
     deltatime_t timerdiff;  
     tick_t curclock = time->clock;  
     tick_t ref   = time->ref;  
     QueryPerformanceCounter( (LARGE_INTEGER*)&curclock );  
     ref = timeGetTime();  
     diff  = curclock - time->clock;  
     refdiff = ref - time->ref;  
     if( refdiff < 0 )  
         refdiff = (tick_t)( 1000.0 * diff * time->oofreq ); //Catch looping of the millisecond counter  
     timerdiff = (deltatime_t)( ( diff * time->oofreq ) - ( refdiff * 0.001 ) );  
     if( ( diff < 0 ) || ( timerdiff > 0.1 ) || ( timerdiff < -0.1 ) )  
         diff = (tick_t)( ( refdiff * 0.001 ) * time->freq ); //Performance counter messed up, transform reference to counter frequency  
     dt = diff;  
     if( reset )  
         time->ref = ref;  
     tick_t curclock;  
     struct timespec ts = { .tv_sec = 0, .tv_nsec = 0 };  
     int err = clock_gettime( CLOCK_REALTIME, &ts );  
     if( err < 0 )  
     curclock = ( (tick_t)ts.tv_sec * 1000000ULL ) + ( ts.tv_nsec / 1000ULL );  
     dt = curclock - time->clock;  
     if( reset )  
         time->clock = curclock;  
     return dt;  
This concludes the first part of my profiling library implementation. The full code is available in the github repo mentioned earlier. Any comments, rants or suggestions are welcome! The next part will be a discussion about the design and requirements of the actual profiling API.