Stage Kurt Vermeersch: benchmark

I created a python script (see SVN or the python source file) that determines how many flops the system reaches without our XenBench application running to get a point of comparison. To achieve this I used the HPL (High-Performance Linpack) benchmark, which can be found here. Afterwards we determine the number of flops we can reach when our XenBench application is profiling, followed by the flops we get when the curses and the gui interfaces are used. After the benchmark, these measurements are presented in a clear overview, that states the change in performance such that conclusions about the amount of overhead can be made.

Installing the HPL Benchmarkt was pretty difficult, so a step-by-step overview is given here:

Download and untar HPL Linpack Benchmark from the website.
Make sure openmpi and the other dependencies for HPL are installed: 'sudo apt-get install gcc gcc-4.3 gfortran-4.3 make openmpi-bin libatlas-base-dev libblas-dev libopenmpi-dev'.
Make sure GoToBlas2 is installed, it can be found here. Untar the download and build the GoToBlas software: './quickbuild.64bit'. Copy the shared library: 'cp libgoto2.so /usr/local/lib'.
Download an architecture Make file for Ubuntu here, and modify its entries so the right paths are being used. Make sure to paste the file in the hpl-2.0 folder.
Build the HPL software: 'sudo make arch=Ubuntu'. For me, this still resulted in the following errors:
make[2]: Entering directory `/home/kurt/Desktop/hpl-2.0/src/auxil/Ubuntu'
Makefile:47: Make.inc: No such file or directory
make[2]: *** No rule to make target `Make.inc'. Stop.
and
/usr/lib/libtbb.so not found
The first problem was solved by removing the 'include Make.inc' lines from the different make files, a modified version of hpl can be downloaded here. The second problem on the other hand was easily solved by installing libtbb: 'sudo apt-get install libtbb-dev'.
Try building HPL again, it should work now! Go to 'hpl-2.0/bin/Ubuntu' and run the benchmark (modify HPL.dat to your preferences first): 'mpirun -np 8 xhpl'.

Our gathering tool was designed to have a very small overhead, so it's a good idea to find a way to measure the extra load this tool introduces on the system. A large overhead would limit the usefulness of our application as its goal was to gather information about the CPU, disk and cache influencing the results ourselves as little as possible. The gathered information should represent the resource usage of the application/benchmark we are running on the virtualized environment.

The developed application has several phases:

A 'startup' phase in which the tools needed to gather the requested information are started
A 'gathering' phase in which the tools gather information and store these in buffers, this way not too much io activity is introduced. During this phase (duration specified by the user) applications/benchmarks can be executed on the virtual machines.
A 'stopping' phase in which the gathering tools are stopped and their output is being written to temporary files (unaltered).
An 'analyzation' phase during which these temporary files are parsed and the required information is extracted from them. This info is written to an overview file that contains all the CPU, disk and cache results in a formatted manner. This phase can be executed on a separate machine to avoid stressing the monitored system.

Since we want to minimalize the influence of running our tool on the results we obtain, it should be clear that we want to minimize the overhead in the 'gathering' phase. Since this phase will almost only contribute to the CPU utilization, we'll measure the overhead that is introduced regarding the CPU usage.

The amount of overhead that the tool introduces can be obtained by running a CPU benchmark first in an 'idle' state of the system (when our gathering tool is not running) and a second time during the 'gathering' phase of our gathering tool. This overhead is represented by the change in score that the CPU benchmark experiences (which can be expressed in a percentage). To better determine the overhead introduced by our tool we'll run the CPU benchmark several times with different loads on our system (generated using the stress tool). This is necessary because it could be that our tool only introduces overhead when certain events actually occur. The remaining challenge is to find a CPU benchmark that measures how much work the system is still able to perform in a certain amount of time. And one that does this by introducing CPU instructions that match an actual real life CPU load, instead of a purely artificial load that uses CPU cycles (since this load could be subject to a large amount of time specific system properties).

Wikipedia describes the total amount of time (t) required to execute a benchmark as N * C / f where

N is the number of instructions actually executed. The value of N can be determined exactly by using an instruction set simulator.
f is the clock frequency (in cycles per second).
C is the average cycles per instruction (CPI) for this benchmark.

So, all these factors say something about the performance of the system on which the benchmark was executed. FLOPS is another related measure of computer performance, it stands for FLoating point OPerations per Second. This metric gives a representation of the performance of the system when running scientific programs (since these have a load consisting of a lot of floating point calculations). Note that FLOPS is also the metric that is used to determine the speed of supercomputers when they are being ranked.

Some great benchmarks do exist, they combine a lot of tools that test the CPU performance, for example by executing different processing intensive tasks: data compression, video compression, artificial intelligence, fluid dynamics, speech recognition, ... These tasks are all included in the SPEC CPU2006 benchmark, which calculates a weighted average of the results of the separate tools (each tool determines the CPU speed using a different kind of load) and determines a total score that is not just influenced by certain characteristics of your system, but gives a rather objective view on its performance. So, this score can be used to compare different systems as well. Another example of such a benchmark is MultiBench by EEMBC. But since these standard benchmarks are rather expensive (an academic license for SPEC CPU2006 costs 200 US dollars), we had a look at some freely available benchmarking tools. Most of them however do not support multicore CPU benchmarking e.g. CoreMark by EEMBC, SuperPi, HINT, ... We however found a widely used benchmarking tool called LinPack that does support multicore processors and determines the MFLOPS the system is able to achieve.

Linpack (optimized for Intel processors) can be downloaded here, while lapack (linear algebra package) can be found here. These tools were originally written in Fortran, but C versions are available as well. The linpack benchmark measures how fast a computer solves a dense N by N system of linear equations Ax = b. The solution is obtained by Gaussian elimination with partial pivoting. The following average was obtained when running this benchmark on my 'idle' system: 10.2949 GFLOPS (the whole output file can be found here). Running the same benchmark on my 'stressed' system (when a load was generated using the stress tool command: stress --cpu 1 --hdd 1 --vm 1 --vm-bytes 64M), gave an average of 8.6983 GFLOPS. The full output file of this benchmark run can be found here.

The next step will be to write a Python program that executes this benchmark and then starts our gathering tool and executes the benchmark again during its 'gathering' phase.

Stage Kurt Vermeersch

Saturday, April 2, 2011

Benchmarking XenBench using HPL

Friday, December 3, 2010

Measure Overhead of Gathering Tool

Blog Archive

About Me