Friday, December 3, 2010

Measure Overhead of Gathering Tool

Our gathering tool was designed to have a very small overhead, so it's a good idea to find a way to measure the extra load this tool introduces on the system. A large overhead would limit the usefulness of our application as its goal was to gather information about the CPU, disk and cache influencing the results ourselves as little as possible. The gathered information should represent the resource usage of the application/benchmark we are running on the virtualized environment.

The developed application has several phases:
  • A 'startup' phase in which the tools needed to gather the requested information are started
  • A 'gathering' phase in which the tools gather information and store these in buffers, this way not too much io activity is introduced. During this phase (duration specified by the user) applications/benchmarks can be executed on the virtual machines.
  • A 'stopping' phase in which the gathering tools are stopped and their output is being written to temporary files (unaltered).
  • An 'analyzation' phase during which these temporary files are parsed and the required information is extracted from them. This info is written to an overview file that contains all the CPU, disk and cache results in a formatted manner. This phase can be executed on a separate machine to avoid stressing the monitored system.
Since we want to minimalize the influence of running our tool on the results we obtain, it should be clear that we want to minimize the overhead in the 'gathering' phase. Since this phase will almost only contribute to the CPU utilization, we'll measure the overhead that is introduced regarding the CPU usage.

The amount of overhead that the tool introduces can be obtained by running a CPU benchmark first in an 'idle' state of the system (when our gathering tool is not running) and a second time during the 'gathering' phase of our gathering tool. This overhead is represented by the change in score that the CPU benchmark experiences (which can be expressed in a percentage). To better determine the overhead introduced by our tool we'll run the CPU benchmark several times with different loads on our system (generated using the stress tool). This is necessary because it could be that our tool only introduces overhead when certain events actually occur. The remaining challenge is to find a CPU benchmark that measures how much work the system is still able to perform in a certain amount of time. And one that does this by introducing CPU instructions that match an actual real life CPU load, instead of a purely artificial load that uses CPU cycles (since this load could be subject to a large amount of time specific system properties).

Wikipedia describes the total amount of time (t) required to execute a benchmark as N * C / f where
  • N is the number of instructions actually executed. The value of N can be determined exactly by using an instruction set simulator.
  • f is the clock frequency (in cycles per second).
  • C is the average cycles per instruction (CPI) for this benchmark.
So, all these factors say something about the performance of the system on which the benchmark was executed. FLOPS is another related measure of computer performance, it stands for FLoating point OPerations per Second. This metric gives a representation of the performance of the system when running scientific programs (since these have a load consisting of a lot of floating point calculations). Note that FLOPS is also the metric that is used to determine the speed of supercomputers when they are being ranked.

Some great benchmarks do exist, they combine a lot of tools that test the CPU performance, for example by executing different processing intensive tasks: data compression, video compression, artificial intelligence, fluid dynamics, speech recognition, ... These tasks are all included in the SPEC CPU2006 benchmark, which calculates a weighted average of the results of the separate tools (each tool determines the CPU speed using a different kind of load) and determines a total score that is not just influenced by certain characteristics of your system, but gives a rather objective view on its performance. So, this score can be used to compare different systems as well. Another example of such a benchmark is MultiBench by EEMBC. But since these standard benchmarks are rather expensive (an academic license for SPEC CPU2006 costs 200 US dollars), we had a look at some freely available benchmarking tools. Most of them however do not support multicore CPU benchmarking e.g. CoreMark by EEMBC, SuperPi, HINT, ... We however found a widely used benchmarking tool called LinPack that does support multicore processors and determines the MFLOPS the system is able to achieve.

Linpack (optimized for Intel processors) can be downloaded here, while lapack (linear algebra package) can be found here. These tools were originally written in Fortran, but C versions are available as well. The linpack benchmark measures how fast a computer solves a dense N by N system of linear equations Ax = b. The solution is obtained by Gaussian elimination with partial pivoting. The following average was obtained when running this benchmark on my 'idle' system: 10.2949 GFLOPS (the whole output file can be found here). Running the same benchmark on my 'stressed' system (when a load was generated using the stress tool command: stress --cpu 1 --hdd 1 --vm 1 --vm-bytes 64M), gave an average of 8.6983 GFLOPS. The full output file of this benchmark run can be found here.

The next step will be to write a Python program that executes this benchmark and then starts our gathering tool and executes the benchmark again during its 'gathering' phase.