The developed application has several phases:
- A 'startup' phase in which the tools needed to gather the requested information are started
- A 'gathering' phase in which the tools gather information and store these in buffers, this way not too much io activity is introduced. During this phase (duration specified by the user) applications/benchmarks can be executed on the virtual machines.
- A 'stopping' phase in which the gathering tools are stopped and their output is being written to temporary files (unaltered).
- An 'analyzation' phase during which these temporary files are parsed and the required information is extracted from them. This info is written to an overview file that contains all the CPU, disk and cache results in a formatted manner. This phase can be executed on a separate machine to avoid stressing the monitored system.
The amount of overhead that the tool introduces can be obtained by running a CPU benchmark first in an 'idle' state of the system (when our gathering tool is not running) and a second time during the 'gathering' phase of our gathering tool. This overhead is represented by the change in score that the CPU benchmark experiences (which can be expressed in a percentage). To better determine the overhead introduced by our tool we'll run the CPU benchmark several times with different loads on our system (generated using the stress tool). This is necessary because it could be that our tool only introduces overhead when certain events actually occur. The remaining challenge is to find a CPU benchmark that measures how much work the system is still able to perform in a certain amount of time. And one that does this by introducing CPU instructions that match an actual real life CPU load, instead of a purely artificial load that uses CPU cycles (since this load could be subject to a large amount of time specific system properties).
Wikipedia describes the total amount of time (t) required to execute a benchmark as N * C / f where
- N is the number of instructions actually executed. The value of N can be determined exactly by using an instruction set simulator.
- f is the clock frequency (in cycles per second).
- C is the average cycles per instruction (CPI) for this benchmark.
Some great benchmarks do exist, they combine a lot of tools that test the CPU performance, for example by executing different processing intensive tasks: data compression, video compression, artificial intelligence, fluid dynamics, speech recognition, ... These tasks are all included in the SPEC CPU2006 benchmark, which calculates a weighted average of the results of the separate tools (each tool determines the CPU speed using a different kind of load) and determines a total score that is not just influenced by certain characteristics of your system, but gives a rather objective view on its performance. So, this score can be used to compare different systems as well. Another example of such a benchmark is MultiBench by EEMBC. But since these standard benchmarks are rather expensive (an academic license for SPEC CPU2006 costs 200 US dollars), we had a look at some freely available benchmarking tools. Most of them however do not support multicore CPU benchmarking e.g. CoreMark by EEMBC, SuperPi, HINT, ... We however found a widely used benchmarking tool called LinPack that does support multicore processors and determines the MFLOPS the system is able to achieve.
Linpack (optimized for Intel processors) can be downloaded here, while lapack (linear algebra package) can be found here. These tools were originally written in Fortran, but C versions are available as well. The linpack benchmark measures how fast a computer solves a dense N by N system of linear equations Ax = b. The solution is obtained by Gaussian elimination with partial pivoting. The following average was obtained when running this benchmark on my 'idle' system: 10.2949 GFLOPS (the whole output file can be found here). Running the same benchmark on my 'stressed' system (when a load was generated using the stress tool command: stress --cpu 1 --hdd 1 --vm 1 --vm-bytes 64M), gave an average of 8.6983 GFLOPS. The full output file of this benchmark run can be found here.
The next step will be to write a Python program that executes this benchmark and then starts our gathering tool and executes the benchmark again during its 'gathering' phase.