FhGFS on ARM

Image

FhGFS is short for Fraunhofer Parallel File System (or Fraunhofer FS) and is developed at the Fraunhofer Institute for Industrial Mathematics (ITWM) in Kaiserslautern, Germany. It can be downloaded and used free of charge from the project’s website http://www.fhgfs.com.Fraunhofer FS

FhGFS is a parallel file system, developed and optimized for high-performance computing and is implemented using a distributed metadata architecture for scalability.

Environment

During the multi-stream test, a constant amount of eight clients was used and the number of storage servers scaled from two to eight.

Striping in FhGFS can be configured on a per-directory and per-file basis. Each directory has a specific stripe pattern configuration, which will be derived to new subdirectories and applied to any file created inside a directory. There are currently two parameters that can be configured for the standard RAID-0 stripe pattern: the desired number of storage targets for each file and the chunk size (or block size) for each file stripe.

Striping for this test was configured across the number of storage servers.

Benchmark Specification – Multi-stream throughput

In this benchmark the total throughput of sequential read and write requests with multiple streams were measured.

FhGFS MultiStream Throughput

x86 on ARM; Benchmark Results for Eltechs ExaGear Server

Eltechs ExaGear Server targets datacenters and cloud providers and enables them to further decrease TCO by running Intel software on power-efficient ARM-based servers. It is reliable, easy-to-use and fast.

But how fast is it? In order to find out we have performed a series of tests. The tests take at look at the overhead introduced running x86 applications through our translator technology and take a close look at the performance implications for CPU intensive workloads, IO and network intensives workloads.

Benchmark Description

For this benchmarking exercise, Eltechs ExaGear on Boston’s ARM-based Viridis servers, we have used GeoBenchmark freely available on http://geocomputing.narod.ru/benchmark.html. This benchmark evaluates a systems capability to perform data processing, and provides a good benchmark to stress both CPU in IO system capability. The benchmark was built for the ARM architecture and Intel 32-bit. Results of ARM-based (or native) tests were compared against Intel 32-bit tests started under Eltechs ExaGear Server.

For both sets of tests same Boston Viridis server was used.

Input and Output

These tests simulate processing modules with heavy disk input & output.

During those tests Eltechs ExaGear Server demonstrated exceptional performance of more than 90% out of native in average.

Multi-CPU

These tests are designed to estimate pure SMP performance and performance of SMP computers on the “memory access sensitive” algorithms. They show how good applications scale in multiprocessor and multithreaded environment.

Results clearly show that Eltechs ExaGear server is super scalable and does not impact parallelism of multi-threaded applications.

 

Conclusions

Eltechs ExaGear Server demonstrated excellent performance, in particular under heavy IO tests with nearly zero impact on performance while running Intel applications on ARM-based servers. This proves Exagear to be an excellent choice for running storage applications, disk and networking intensive tasks in translation mode on ARM servers today.

CPU intensive performance was around 50%. In future Eltechs expected to go as high as 80% out of native performance in average.

Scalability tests clearly showed that Eltechs ExaGear Server is highly scalable. It does not affect parallelism of applications and can be transparently used for respective software.

Taking in to consideration the ease of use, transparency for end users, and the immediate results – the Eltechs Exagear Server provides a real solution for the problem of migrating old legacy applications to ARM in the datacentre.

 

 

Benchmarking Sysbench (OLTP)

Following on from the last benchmarks performed, the ApacheBench tests Apache Benchmarks we wanted to understand how well databases would perform on our Viridis platform. Typically DBs go hand in hand with apache instances and used throughout enterprises in various roles. The tests we will focus on today are OLTP (online transaction processing) random reads from a database.

The Setup:

  • Identical versions on Ubuntu used, 12.04 (different arch versions!)
  • 1GB per core/hyper-threaded core of RAM
  • Single 256GB SSD per server
  • Databases created with 1,000,000 entries
  • Random Reads were performed across 100,000 entries (10% of the database)

The Sysbench Commands used:

(setup the DB): sysbench --test=oltp --mysql-table-engine=myisam --oltp-table-size=1000000 --mysql-user=root prepare

(Viridis Test): sysbench --mysql-user=root --num-threads=4 --max-requests=100000 --test=oltp --oltp-table-size=1000000 --oltp-read-only run

(Intel Server Test): sysbench --mysql-user=root --num-threads=32 --max-requests=100000 --test=oltp --oltp-table-size=1000000 --oltp-read-only run

The Results:

So as expected, node vs node our system achieved ~18% of what the Intel server did. However when you consider the performance per watt, or transactions per watt the overall picture looks much better for Viridiis platform.

ARM vs Atom: Phoronix Benchmarks

The team over at Phoronix have completed a comprehensive suite of tests which compares the performance of a single Boston Viridis SoC to an Intel Atom D525. The tests covered a range of applications including: NAS Parallel Benchmarks, Video Encoding, Rendering, Molecular Dynamics and more. A further set of tests compared the scaling of the ARM cpus and a comparison of the performance improvements between ubuntu 12.04 and 12.10.

“Overall, a single 1.1~1.4GHz Calxeda ECX-1000 Cortex-A9 server node proved competitive against an Intel Atom D525, a x86_64 CPU that is clocked at 1.8GHz with two physical cores plus two logical cores via Hyper Threading. While the Calxeda node did nicely against the Atom D525 in a majority of the Ubuntu Linux benchmarks, the real story is the performance-per-Watt, which unfortunately can’t be easily compared in this case due to the limitations mentioned in the introduction. If there were the power numbers, the Calxeda ARM Server would likely easily win with the SoC power consumption under load averaging 4 Watts for the 1.1GHz card and just over 6 Watts for the newer 1.4GHz variant. The Atom D525 has a rated TDP by Intel of 13 Watts.”

Further details on the test environment and compiler flags used are on the Phoronix pages below:

Power Tests (130w for 24 Nodes)

We’ve been busy working on optimising the power draw on our product, improving airflow, tweaking low level system settings, playing with PSUs and exhaustively programming fanspeeds..  Now it’s time for some real world tests, measured “at the wall”!

System used for the tests:

  • Boston Viridis with 24 nodes (6 energy cards)
  • 24x 256GB SSD drives
  • 4GB Ram per node (96GB in total)

Test conditions:

We ran the tests over a 30 minute period, taking results after 15 minutes at 1 minutes intervals. The results were averaged across this period of time and the power measurements were recorded on a Rohde & Schwarz HAMEG HM8115-2

Results:

These are some excellent results and really show what our solution is capable of running some real world workloads. Just to put this in perspective, a standard x86 dual socket server can run anywhere up to 350w! Our 24 node configuration is roughly equivalent to a standard low power dual socket x86 system (with regards to power consumption).

Update: Further evidence of the power consumption figures above from our friends at calxeda:

Docking Throughput on Viridis

We would like to thanks our friends at the HPC service in Imperial College London for working with on the following tests and providing some great feedback:

<snip>

The test performed was one of molecular docking using the Vina code: http://vina.scripps.edu/

Docking could potentially be a pretty good fit for this type of system because it’s the sort of thing that’s often run in ensembles, so is throughput-oriented. It’s CPU intensive, a mix of integer and fp.

On the ARM system, I compiled with the system Boost, g++ 4.6.3 and compiler flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -fpermissive

On my x86 system (dual E2620, turbo enabled, HT enabled) I used the distributed vina binary.

The test model is HIV protease and a ligand from the DUD docking test set.

Vina was run with:


vina --seed 0 --size_x 59.358 --center_x 4.486 --size_y 35.873 --center_y 0.8825 --size_z 38.609 --center_z 17.8075
--receptor receptor.pdbqt
--ligand ligand.pdbqt
--cpu 4

I elected to run it with 4 threads, which is not the most efficient for maximising throughput (there’s a serial component at the start of the test), but I wanted a threaded component in the test, and I’ll correct for that in the analysis by using CPU time, rather than elapsed wall.

Here are the timings:

ARM

1 run, @4px: 2777.86 user 12:18.62 elapsed 376%CPU

For 6 TASKS:
x86: 278 minutes of CPU time

6 runs @4px: individual ave 1192.94 5:18.70 elapsed 374%CPU

For 6 TASKS: 19.9 minutes CPU time, 5:18m of walltime

So that’s a throughput difference of ~14x between the dual E5-2620 (24t) and the 4core Viridis SoC.

Looking at power, an estimate of the energy required to do 6 repetitions:
Viridis = 7W * 12:18m * 6runs =~ 31kJ
x86 = 200W * 5:18m * 6/6 (all runs simultaneous ) =~ 64kJ

The ARM system is about twice as power efficient as the x86. It might be low power, but it takes a long time getting to the end.

What does this mean in practice? Imagine building a cluster to do nothing but run this code:

*) A Boston Viridis cluster built to the same power budget as an x86 one would
– have (200W/7W) ~ 28x the number of nodes
– a throughput (28 / 14x ) = ~2x that of the x86.

*) A calxeda cluster built to match throughput with the x86 one will
– need 14x the number of nodes
– require ~.5x the power

*) calxeda built to the same volume as an x86 one will have
– 36x # nodes (72/u / 2/u)
– 6x # cores (6 core counting HT)
– ~1.3x higher power draw ( 72*7W / 200W)
– ~2.7x the throughput of the x86. (36x/ 14x)

To conclude:
The Boston Viridis has a ~2x energy advantage on throughput compute-intensive workload, but this is substantially lower than the *power* advantage (~28x) would suggest because of markedly reduced performance (~1/14x) relative to the x86.

P.S: (Sell these things for Raspberry Pi prices and I’ll buy a container-load).

ApacheBench on Viridis

Our friends over at calxeda have recently posted some interesting Apache benchmarks on the energy core cards:

[John Mao of Calxeda] It’s the middle of June, which means we’re smack in the middle of tradeshow and conference season for the IT industry. We were at Computex in Taipei two weeks ago, and this week we’re participating in International Supercomputing in Hamburg, and GigaOM’s Structure conference in San Francisco. In fact, our CEO, Barry Evans, is on a panel to discuss fabric technologies and their role in the evolution of datacenters. Should be a good one!
In spite of the hectic season, it hasn’t stopped us from moving forward with what everyone is really waiting for: benchmarks! Well, I’m happy to be able to share some preliminary results of both performance and power consumption for those of you looking for more efficient web servers. Continue reading