A Performance Counters Library
for Intel/AMD Processors and Linux

Don Heller
Associate with the Scalable Computing Laboratory
Ames Laboratory, U.S. D.O.E., Iowa State University

Tour by Examples | Download the Library

Thesis:  All programs should be designed to measure their own performance.

Precondition:  Performance measurement software should be completely portable.

Preclusions:  Performance measurement hardware and operating system interfaces are completely non-portable.

The role of this library is to read and manipulate Intel or AMD processor hardware event counters in C under the Linux operating system.  The user interfaces have been made general and machine-independent. There is necessarily a conflict between the desire for future implementations on other systems, allowing portable self-measured code in several languages, and the desire to squeeze as much information as soon as possible from the particular processor at hand.

If we view performance measurement as a scientific experiment, then use of the library in the application source code is like building an experimental apparatus, or attaching sensors to an existing apparatus.  A few tests may confirm that the program behaves according to the programmer's predictions.  But why not leave the measuring capability in the program?  There may be something useful that can be learned from more experience with the program in the field, and perhaps the program can adjust its own behavior in response to the measurements.

The simplest summary of the library design is this:

There are only a few principal functions and data types in the library: The Intel Pentium-series processors include a 64-bit cycle counter, and two 40-bit event counters, with a list of events and additional semantics that depend on the particular processor.  The AMD Athlon processor has a 64-bit cycle counter, and four 48-bit event counters, with a different set of events and semantics similar to the Pentium Pro family. The library abstracts these details to a type system and compile-time constants allowing (we hope) further implementations.

Arbitrary programs that do not use the hardware counters directly or indirectly can be sampled with rabbit, which runs a child process alongside its own interval timer.  For example, 'rabbit sleep 60' is a simple system monitor.   rabbit multiplexes through a list of events, and can manage an output directory with data files and gnuplot scripts.  Summary reports are easily generated.

There are, of course, some flaws and limitations in the library design and implementation, centered on the process state and the goal of clean measurements.

Hardware performance counters are defined outside the "architectural" register set, and they are not saved and restored on process context switches, either by the hardware or by [unpatched] Linux.  The measurements are therefore attached to the processor, and not to a process or thread.  It is not possible to separate the actions of a daemon, or another user, from the program under test, but it is possible to separate user code from system code according to the privilege level.  Thus, as always, testing for program development should be done with as few other processes in the system as possible.  On a dual-processor system, a process context switch could move the process or thread to the other processor, leaving the selected performance counters behind.  The current implementation makes no serious attempt to deal with dual processors.  The event counters, being rather short, are prone to overflow at high MHz; it is no consolation to observe that other processors use even fewer bits.  On some processors, an interrupt can be generated when a counter overflows, but we do not observe this.  The user must ensure that counters are read frequently enough, and ask if the results seem reasonable.  To negotiate exclusive access to the counters, and to run some privileged instructions, the /dev/pmc device must be installed.  If every access to the hardware counters goes through the PMC library to /dev/pmc, we can guarantee clean measurements, but this constraint is not enforceable.

The library's principal data types to be understood are

pmc_control_t complete description of a measurement experiment
pmc_event_set_t event codes for concurrent measurement
pmc_data_t raw cycle and event counter readings
pmc_counter_t elapsed time and accumulated event counts
components of pmc_data_t
components of pmc_counter_t

The library's principal functions are

pmc_getargs() read command-line options
acquire and release the hardware counters
pmc_start() mark the start of the experiment
pmc_select() select the events for the hardware counters
pmc_read() read the counters
pmc_counter_init() initialize an accumulator
pmc_accumulate() accumulate the counters from a time interval
pmc_print_results() report the results

For more details, take the Tour by Examples.  At various points along the tour there are philosophical and technical discussions related to the library design and implementation.

Some related sites and projects, with commentary:

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@cse.psu.edu