Stage Kurt Vermeersch: 2010

Friday, December 3, 2010

Measure Overhead of Gathering Tool

Our gathering tool was designed to have a very small overhead, so it's a good idea to find a way to measure the extra load this tool introduces on the system. A large overhead would limit the usefulness of our application as its goal was to gather information about the CPU, disk and cache influencing the results ourselves as little as possible. The gathered information should represent the resource usage of the application/benchmark we are running on the virtualized environment.

The developed application has several phases:

A 'startup' phase in which the tools needed to gather the requested information are started
A 'gathering' phase in which the tools gather information and store these in buffers, this way not too much io activity is introduced. During this phase (duration specified by the user) applications/benchmarks can be executed on the virtual machines.
A 'stopping' phase in which the gathering tools are stopped and their output is being written to temporary files (unaltered).
An 'analyzation' phase during which these temporary files are parsed and the required information is extracted from them. This info is written to an overview file that contains all the CPU, disk and cache results in a formatted manner. This phase can be executed on a separate machine to avoid stressing the monitored system.

Since we want to minimalize the influence of running our tool on the results we obtain, it should be clear that we want to minimize the overhead in the 'gathering' phase. Since this phase will almost only contribute to the CPU utilization, we'll measure the overhead that is introduced regarding the CPU usage.

The amount of overhead that the tool introduces can be obtained by running a CPU benchmark first in an 'idle' state of the system (when our gathering tool is not running) and a second time during the 'gathering' phase of our gathering tool. This overhead is represented by the change in score that the CPU benchmark experiences (which can be expressed in a percentage). To better determine the overhead introduced by our tool we'll run the CPU benchmark several times with different loads on our system (generated using the stress tool). This is necessary because it could be that our tool only introduces overhead when certain events actually occur. The remaining challenge is to find a CPU benchmark that measures how much work the system is still able to perform in a certain amount of time. And one that does this by introducing CPU instructions that match an actual real life CPU load, instead of a purely artificial load that uses CPU cycles (since this load could be subject to a large amount of time specific system properties).

Wikipedia describes the total amount of time (t) required to execute a benchmark as N * C / f where

N is the number of instructions actually executed. The value of N can be determined exactly by using an instruction set simulator.
f is the clock frequency (in cycles per second).
C is the average cycles per instruction (CPI) for this benchmark.

So, all these factors say something about the performance of the system on which the benchmark was executed. FLOPS is another related measure of computer performance, it stands for FLoating point OPerations per Second. This metric gives a representation of the performance of the system when running scientific programs (since these have a load consisting of a lot of floating point calculations). Note that FLOPS is also the metric that is used to determine the speed of supercomputers when they are being ranked.

Some great benchmarks do exist, they combine a lot of tools that test the CPU performance, for example by executing different processing intensive tasks: data compression, video compression, artificial intelligence, fluid dynamics, speech recognition, ... These tasks are all included in the SPEC CPU2006 benchmark, which calculates a weighted average of the results of the separate tools (each tool determines the CPU speed using a different kind of load) and determines a total score that is not just influenced by certain characteristics of your system, but gives a rather objective view on its performance. So, this score can be used to compare different systems as well. Another example of such a benchmark is MultiBench by EEMBC. But since these standard benchmarks are rather expensive (an academic license for SPEC CPU2006 costs 200 US dollars), we had a look at some freely available benchmarking tools. Most of them however do not support multicore CPU benchmarking e.g. CoreMark by EEMBC, SuperPi, HINT, ... We however found a widely used benchmarking tool called LinPack that does support multicore processors and determines the MFLOPS the system is able to achieve.

Linpack (optimized for Intel processors) can be downloaded here, while lapack (linear algebra package) can be found here. These tools were originally written in Fortran, but C versions are available as well. The linpack benchmark measures how fast a computer solves a dense N by N system of linear equations Ax = b. The solution is obtained by Gaussian elimination with partial pivoting. The following average was obtained when running this benchmark on my 'idle' system: 10.2949 GFLOPS (the whole output file can be found here). Running the same benchmark on my 'stressed' system (when a load was generated using the stress tool command: stress --cpu 1 --hdd 1 --vm 1 --vm-bytes 64M), gave an average of 8.6983 GFLOPS. The full output file of this benchmark run can be found here.

The next step will be to write a Python program that executes this benchmark and then starts our gathering tool and executes the benchmark again during its 'gathering' phase.

Tuesday, November 23, 2010

XenGatherer Update

Today the XenGatherer tool got a bit more functionality, an update about this progress will be given first. This will be followed by a description of the issues that still exist. Next week we'll have a look at the overhead that running this tool causes (in particular on a CPU usage level).

The things that were added to/modified on the XenGatherer tool today:

First of all some small issues that were introduced while coding at home (without being able to run everything) were fixed (e.g. disk columns were not well formatted).
The overview of the CPU usage was made more readable (see snapshot under the blogpost).
Dom0 stats were added to the disk section of the resource usage overview. In this overview a MB/s read and write column was added (iostat provided a -m option to facilitate this).
An option to provide a file containing the vm names to monitor was added, namely "--vmfile". If this optional argument is not provided the application will now parse the 'xm list' output and use all the vms that are currently running.
After copying the 'vmlinux' kernel file to the node (and rebooting the VMs) the oprofile functionality to monitor the occurrence of cache miss and hit events worked fine. But since the generated output was rather extensive (eg for one domain there was a counter of samples for the kernel, xen, modules,...), we now take the sum of these counters (for each domain) and present these values in the overview. Note that there is still an option that can be used when the full output is preferred: "--nosumcache".

Here the topics that need further attention are listed:

First of all dom0 stats should still be added to the cache hits/misses section.
The CPU stats cause problems when the domain id crosses the 32 boundary, the XenBaked code should be optimized such that no problems arise.
Also, we should have a look whether the LLC_REF events do indicate a cache hit or whether the cache miss events should be subtracted from this number to get the number of cache hits.
And last but not least, the source code/design should be reviewed and comments should be written.

Friday, November 19, 2010

Xen Monitoring Tool (cache)

The monitoring tool now gathers information about the cache hits and misses as well, using OProfile. Through program arguments it is now possible to choose what will be monitored (e.g. --nodisk triggers the disk information to no longer be included). Also the output is formatted now, such that the columns are alligned and everything is more readable. An example of this output can be downloaded here.

Friday, November 12, 2010

Xen Monitoring Tool (disk and cpu) & Xen 4.0

On Tuesday I continued working on the monitoring app: disk and cpu metrics are now gathered and written to file. Also to make sure the tap:aio interface worked correctly when starting VMs, I had to install a newer Xen version (4.0). The way the application works now is by first calling the start operation on the XenBaked, IOStat and OProfile controllers, then a given command is executed on the specified VMs before going to sleep for the given duration. Afterwards the stop method of the controllers is called, followed by a report (write output to file) and parse (get the desired information from the output files) call for the different tools. Finally this disk, cpu and cache information is written to file. Working this way makes sure IO load is only generated after the experiment is finished, by writing to file when report is called after the monitoring has stopped. Also the tool is lightweight because the parsing of the output and creation of the actual output file can be performed on a separate machine. What's next? Finishing the prototype of the monitoring tool by completing the cache hits and misses information. And making sure the results are correct, by running the app under certain VM workloads.

Sunday, November 7, 2010

Patch, IOStat & Parsing

Friday the 5th of November I first of all got the patch finally working. This wasn't working before, because I was creating a patch while the patched kernel was compiled and the original one was not (this diff resulted in an extremely large patch file).

Afterwards both Sam and I worked on the XenGatherer tool: Sam added the functionality of IOStat so now it is possible to monitor the disk usage of the virtual machines and dom0. I started parsing the ouput files, so our tool generates an output file containing only the required information.

Tuesday, November 2, 2010

OProfile => XenGatherer

I extended the XenGatherer prototype such that when it is started opcontrol starts gathering information about cache misses (event LLC_MISSES):

opcontrol --reset
opcontrol --start-daemon --event=LLC_MISSES:10000 --xen=... --vmlinux=...
opcontrol --start

When the XenGatherer tool is stopped, opcontrol is stopped as well:

opcontrol --stop
opcontrol --shutdown

And when you choose to make a report (throught the XenGatherer CLI) the retrieved information is written to file. The information gathered by OProfile is requested by the following command:

opreport event:LLC_MISSES

An updated version of the Python source files can be found here.

Wednesday, October 27, 2010

Xen(Trace/Mon/Baked) => XenGatherer

I tried to create a first version of the instrumentation app for Xen VMs that already gathers the information that is made available by XenTrace. I started by having a look at the following sources:

A tutorial which explains how to use XenTrace (and xentrace_format) and XenMon (and xenbaked) found here.
Information about the parameters that are monitored and how the tools hierarchy (see image below) can be found here.
A paper titled "XenMon: QoS Monitoring and Performance Profiling Tool". This whitepaper can be downloaded here.

While discovering the above mentioned tools I had a closer look at their implementations and decided that for the parameters I need that are made available by XenTrace I'll keep using XenBaked since it does not bring a lot of overhead (its output is stored in a binary file/shared memory). I adjusted the XenMon Python script such that it does not write to file constantly (less io interrupts) while gathering information through XenBaked. The source code can be found here.

If you run the XenGatherer tool, you get a CLI with the following possibilities:

start: start gathering information by running xenbaked
stop: stop gathering information by killing the xenbaked process
report: get the desired data from the mmapped file (/var/run/xenq-shm) and write it to a text file
quit: exit the application

Saturday, October 23, 2010

Patching the Linux Kernel

During this week I managed to get the newer kernel version to work with oprofile: I had to make changes to a couple of source file to get it to compile. Then I tried to make a patch file by using the diff command that checks the differences that exist between nonpatched and patched kernel files. But applying this patch didn't work. So, now I googled how to create a patch file again and found some things here that I probably did wrong:

When creating the patch, the new kernel was compiled, which explains the fact that I genereated a 200 MB patch when using the root folders of the kernels.

"Do not ever create a patch with uneven directories" does the site state, well I guess this is what I did.

Next week I'll first try to get this patch working, afterwards I'll start writing a Python application that already gathers one of the parameters (learning the Python language in the mean time).

Monday, October 18, 2010

What's next

Tomorrow I'll try to get the test environment up and running (with a patched more recent kernel version). And during the week I'll also set this up on a machine at home, such that I can develop at home as well. If all goes well I'll start looking into creating a little python program that already gathers the information for one of the parameters.

Wednesday, October 13, 2010

First day & Planning

Yesterday was the first day of working on this project, the objective of the day: setting up the test environment according to a tutorial made by Sam Verboven.
An outline of this installation process:

Download a pv-ops enabled kernel, patch it with the xenoprof patch, compile and install it.
Download Xen 3.4.3-testing, compile and install it (also add a Grub entry for Xen).
Boot to Xen and create a new Ubuntu 9 image using the xen tools (and make sure this VM runs correctly).
Install (Xen)OProfile and make sure the profiling features work correctly.

An overview of the steps required to build the application:

Updating the test environment by installing a patched more recent kernel version
Make a Python (rapid prototyping, efficient for string operations) tool that gathers the desired parameters from the selected tools:
- Xen Trace for CPU usage, blocked, waiting and allocation
- XenoProfile for cache hits and misses
- SSH connection to VMs (DomU Kernel) for disk block reads/writes, requests/s and average wait and service times
Extend the gathering tool with a parser that actually parses and formats the gathered data
Make sure running the tool introduces little overhead to the system
Make the application even more black-box by finding a way around getting the IO data through SSH connections to the VMs
Modify XenMon such that the data can be observed through a GUI

Friday, October 8, 2010

Performance Interference Effects in Virtual Environments

I read the paper "An Analysis of Performance Interference Effects in Virtual Environments" by Younggyun Koh, Rob Knauerhase, Paul Brett, Mic Bowman, Zhihua Wen, Calton Pu. An abstract can be found here: http://goo.gl/9pR9.
It gave a good amount of background information about the things we want to realize.

Wednesday, October 6, 2010

Introduction

Today I had a first meeting about my research project for this year, last year I decided to do this again with Sam Verboven as my supervisor. I look forward to working on the project and hope it will be learnful. I'll try to keep this blog up to date with the progress that is being made.

The assignment consists of building a 'blackbox' application which uses information about the applications that are running in a virtualized environment. It should make decisions about when to migrate a virtual machine for a certain application. This is the case when the VM starts to interfere with other VMs too much. When the load becomes too low on the other hand, decisions about the consolidation of VMs should be made.
The input for this application consists of the following information gathered from the different VMs/the system:
- average CPU utilization
- cache hits and misses
- virtual machine switches per second
- I/O blocks per second
- disk reading and writing time per VM
This gathering of information should have a minimal impact on the system: low resource usage of the application and should require no (or minimal) adaptation of DomU. This information about the loads could also be important for getting the required input information for my thesis project (where this data is initially considered as known).

The first steps this week will be to read the paper "An Analysis of Performance Interference Effects in Virtualized Enviroments", follow Sams tutorial to set up a test system with the required tools and patch the latest Xen kernel such that XenOProfile works fine.

Stage Kurt Vermeersch