Stage Kurt Vermeersch

Saturday, April 2, 2011

Benchmarking XenBench using HPL

I created a python script (see SVN or the python source file) that determines how many flops the system reaches without our XenBench application running to get a point of comparison. To achieve this I used the HPL (High-Performance Linpack) benchmark, which can be found here. Afterwards we determine the number of flops we can reach when our XenBench application is profiling, followed by the flops we get when the curses and the gui interfaces are used. After the benchmark, these measurements are presented in a clear overview, that states the change in performance such that conclusions about the amount of overhead can be made.

Installing the HPL Benchmarkt was pretty difficult, so a step-by-step overview is given here:

Download and untar HPL Linpack Benchmark from the website.
Make sure openmpi and the other dependencies for HPL are installed: 'sudo apt-get install gcc gcc-4.3 gfortran-4.3 make openmpi-bin libatlas-base-dev libblas-dev libopenmpi-dev'.
Make sure GoToBlas2 is installed, it can be found here. Untar the download and build the GoToBlas software: './quickbuild.64bit'. Copy the shared library: 'cp libgoto2.so /usr/local/lib'.
Download an architecture Make file for Ubuntu here, and modify its entries so the right paths are being used. Make sure to paste the file in the hpl-2.0 folder.
Build the HPL software: 'sudo make arch=Ubuntu'. For me, this still resulted in the following errors:
make[2]: Entering directory `/home/kurt/Desktop/hpl-2.0/src/auxil/Ubuntu'
Makefile:47: Make.inc: No such file or directory
make[2]: *** No rule to make target `Make.inc'. Stop.
and
/usr/lib/libtbb.so not found
The first problem was solved by removing the 'include Make.inc' lines from the different make files, a modified version of hpl can be downloaded here. The second problem on the other hand was easily solved by installing libtbb: 'sudo apt-get install libtbb-dev'.
Try building HPL again, it should work now! Go to 'hpl-2.0/bin/Ubuntu' and run the benchmark (modify HPL.dat to your preferences first): 'mpirun -np 8 xhpl'.

Saturday, March 19, 2011

Python XenBench Plotting GUI

After a little research I decided to look into the software I found and try and make a plotting GUI proof of concept. Matplotlib, the plotting library that I used, is described as follows:

Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB or Mathematica), web application servers, and six graphical user interface toolkits.

The following steps need to be taken to install matplotlib:

sudo apt-get install python-dev
download numpy, which is the fundamental package needed for scientific computing with Python, here
tar -xvf numpy-1.5.1.tar.gz
sudo apt-get install gfortran
cd numpy-1.5.1
sudo python setup.py install
sudo apt-get install python-matplotlib or get it here

The following steps should be taken to install general python GUI libraries:

sudo apt-get install python-tk
download pmw here
tar -xvf Pmw.1.3.2.tar.gz
cd Pmw.1.3.2
cd src
sudo python setup.py install

I created a tabbed pane view with tabs for the CPU, Cache and I/O section of our application, I added a menubar that can be used for certain actions (such as saving a certain graph, starting and stopping the real time monitoring, ...). I added a statusbar to the application as well, and I embedded a graph in the CPU pane. I got some inspiration from snippets of code that can be found here.

Note that Glade (which can be found here) could come in handy to design the GUI. Note also that I first tried to work with GNUPlot (using gnuplot-py), but the embedding of these graphs was not fully supported.

PS: Last week I made a little progress with the Curses interface for the CPU utilization monitoring, more on that later.

Saturday, March 5, 2011

Proof of Concept Curses Program

I created a proof of concept project to illustrate how we can use curses programming in Python to provide a shell program that updates regularly. This will be useful for users to monitor the cpu usage of the different VMs from within the terminal. I found inspiration here and here.

The demo project can be found on SVN as the "CursesTest" project or can be downloaded here.

Monday, February 28, 2011

Intermediate Presentation

During the last week I gave a presentation about what I've done so far (together with Sam Verboven) for my research project this year. I think this presentation went rather good, although I was quite nervous. You can download the ppt presentation here, a pdf version can be acquired as well: here.
On Wednesday I'll start doing my weekly day at the office in uni again.

Thursday, January 6, 2011

Lack of Updates

Just a little post to explain the lack of updates in December and January. Well during this period at Belgian universities students need to study for and take their exams of the courses they took during the first academic semester. So, that is what I've been doing: studying and finishing other projects.
The last two weeks of the semester Sam and I discovered that the results of the CPU measurements were not correct. This had to do with the fact that in some time intervals no Xen Trace Events occurred. A couple of changes were made to the XenBaked and modified XenMon code, which made the results more reliable.
Expect more updates from me during the second semester, which starts February 14th.

Friday, December 3, 2010

Measure Overhead of Gathering Tool

Our gathering tool was designed to have a very small overhead, so it's a good idea to find a way to measure the extra load this tool introduces on the system. A large overhead would limit the usefulness of our application as its goal was to gather information about the CPU, disk and cache influencing the results ourselves as little as possible. The gathered information should represent the resource usage of the application/benchmark we are running on the virtualized environment.

The developed application has several phases:

A 'startup' phase in which the tools needed to gather the requested information are started
A 'gathering' phase in which the tools gather information and store these in buffers, this way not too much io activity is introduced. During this phase (duration specified by the user) applications/benchmarks can be executed on the virtual machines.
A 'stopping' phase in which the gathering tools are stopped and their output is being written to temporary files (unaltered).
An 'analyzation' phase during which these temporary files are parsed and the required information is extracted from them. This info is written to an overview file that contains all the CPU, disk and cache results in a formatted manner. This phase can be executed on a separate machine to avoid stressing the monitored system.

Since we want to minimalize the influence of running our tool on the results we obtain, it should be clear that we want to minimize the overhead in the 'gathering' phase. Since this phase will almost only contribute to the CPU utilization, we'll measure the overhead that is introduced regarding the CPU usage.

The amount of overhead that the tool introduces can be obtained by running a CPU benchmark first in an 'idle' state of the system (when our gathering tool is not running) and a second time during the 'gathering' phase of our gathering tool. This overhead is represented by the change in score that the CPU benchmark experiences (which can be expressed in a percentage). To better determine the overhead introduced by our tool we'll run the CPU benchmark several times with different loads on our system (generated using the stress tool). This is necessary because it could be that our tool only introduces overhead when certain events actually occur. The remaining challenge is to find a CPU benchmark that measures how much work the system is still able to perform in a certain amount of time. And one that does this by introducing CPU instructions that match an actual real life CPU load, instead of a purely artificial load that uses CPU cycles (since this load could be subject to a large amount of time specific system properties).

Wikipedia describes the total amount of time (t) required to execute a benchmark as N * C / f where

N is the number of instructions actually executed. The value of N can be determined exactly by using an instruction set simulator.
f is the clock frequency (in cycles per second).
C is the average cycles per instruction (CPI) for this benchmark.

So, all these factors say something about the performance of the system on which the benchmark was executed. FLOPS is another related measure of computer performance, it stands for FLoating point OPerations per Second. This metric gives a representation of the performance of the system when running scientific programs (since these have a load consisting of a lot of floating point calculations). Note that FLOPS is also the metric that is used to determine the speed of supercomputers when they are being ranked.

Some great benchmarks do exist, they combine a lot of tools that test the CPU performance, for example by executing different processing intensive tasks: data compression, video compression, artificial intelligence, fluid dynamics, speech recognition, ... These tasks are all included in the SPEC CPU2006 benchmark, which calculates a weighted average of the results of the separate tools (each tool determines the CPU speed using a different kind of load) and determines a total score that is not just influenced by certain characteristics of your system, but gives a rather objective view on its performance. So, this score can be used to compare different systems as well. Another example of such a benchmark is MultiBench by EEMBC. But since these standard benchmarks are rather expensive (an academic license for SPEC CPU2006 costs 200 US dollars), we had a look at some freely available benchmarking tools. Most of them however do not support multicore CPU benchmarking e.g. CoreMark by EEMBC, SuperPi, HINT, ... We however found a widely used benchmarking tool called LinPack that does support multicore processors and determines the MFLOPS the system is able to achieve.

Linpack (optimized for Intel processors) can be downloaded here, while lapack (linear algebra package) can be found here. These tools were originally written in Fortran, but C versions are available as well. The linpack benchmark measures how fast a computer solves a dense N by N system of linear equations Ax = b. The solution is obtained by Gaussian elimination with partial pivoting. The following average was obtained when running this benchmark on my 'idle' system: 10.2949 GFLOPS (the whole output file can be found here). Running the same benchmark on my 'stressed' system (when a load was generated using the stress tool command: stress --cpu 1 --hdd 1 --vm 1 --vm-bytes 64M), gave an average of 8.6983 GFLOPS. The full output file of this benchmark run can be found here.

The next step will be to write a Python program that executes this benchmark and then starts our gathering tool and executes the benchmark again during its 'gathering' phase.

Tuesday, November 23, 2010

XenGatherer Update

Today the XenGatherer tool got a bit more functionality, an update about this progress will be given first. This will be followed by a description of the issues that still exist. Next week we'll have a look at the overhead that running this tool causes (in particular on a CPU usage level).

The things that were added to/modified on the XenGatherer tool today:

First of all some small issues that were introduced while coding at home (without being able to run everything) were fixed (e.g. disk columns were not well formatted).
The overview of the CPU usage was made more readable (see snapshot under the blogpost).
Dom0 stats were added to the disk section of the resource usage overview. In this overview a MB/s read and write column was added (iostat provided a -m option to facilitate this).
An option to provide a file containing the vm names to monitor was added, namely "--vmfile". If this optional argument is not provided the application will now parse the 'xm list' output and use all the vms that are currently running.
After copying the 'vmlinux' kernel file to the node (and rebooting the VMs) the oprofile functionality to monitor the occurrence of cache miss and hit events worked fine. But since the generated output was rather extensive (eg for one domain there was a counter of samples for the kernel, xen, modules,...), we now take the sum of these counters (for each domain) and present these values in the overview. Note that there is still an option that can be used when the full output is preferred: "--nosumcache".

Here the topics that need further attention are listed:

First of all dom0 stats should still be added to the cache hits/misses section.
The CPU stats cause problems when the domain id crosses the 32 boundary, the XenBaked code should be optimized such that no problems arise.
Also, we should have a look whether the LLC_REF events do indicate a cache hit or whether the cache miss events should be subtracted from this number to get the number of cache hits.
And last but not least, the source code/design should be reviewed and comments should be written.