Power Tests (130w for 24 Nodes)

We’ve been busy working on optimising the power draw on our product, improving airflow, tweaking low level system settings, playing with PSUs and exhaustively programming fanspeeds..  Now it’s time for some real world tests, measured “at the wall”!

System used for the tests:

  • Boston Viridis with 24 nodes (6 energy cards)
  • 24x 256GB SSD drives
  • 4GB Ram per node (96GB in total)

Test conditions:

We ran the tests over a 30 minute period, taking results after 15 minutes at 1 minutes intervals. The results were averaged across this period of time and the power measurements were recorded on a Rohde & Schwarz HAMEG HM8115-2

Results:

These are some excellent results and really show what our solution is capable of running some real world workloads. Just to put this in perspective, a standard x86 dual socket server can run anywhere up to 350w! Our 24 node configuration is roughly equivalent to a standard low power dual socket x86 system (with regards to power consumption).

Update: Further evidence of the power consumption figures above from our friends at calxeda:

Docking Throughput on Viridis

We would like to thanks our friends at the HPC service in Imperial College London for working with on the following tests and providing some great feedback:

<snip>

The test performed was one of molecular docking using the Vina code: http://vina.scripps.edu/

Docking could potentially be a pretty good fit for this type of system because it’s the sort of thing that’s often run in ensembles, so is throughput-oriented. It’s CPU intensive, a mix of integer and fp.

On the ARM system, I compiled with the system Boost, g++ 4.6.3 and compiler flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -fpermissive

On my x86 system (dual E2620, turbo enabled, HT enabled) I used the distributed vina binary.

The test model is HIV protease and a ligand from the DUD docking test set.

Vina was run with:


vina --seed 0 --size_x 59.358 --center_x 4.486 --size_y 35.873 --center_y 0.8825 --size_z 38.609 --center_z 17.8075
--receptor receptor.pdbqt
--ligand ligand.pdbqt
--cpu 4

I elected to run it with 4 threads, which is not the most efficient for maximising throughput (there’s a serial component at the start of the test), but I wanted a threaded component in the test, and I’ll correct for that in the analysis by using CPU time, rather than elapsed wall.

Here are the timings:

ARM

1 run, @4px: 2777.86 user 12:18.62 elapsed 376%CPU

For 6 TASKS:
x86: 278 minutes of CPU time

6 runs @4px: individual ave 1192.94 5:18.70 elapsed 374%CPU

For 6 TASKS: 19.9 minutes CPU time, 5:18m of walltime

So that’s a throughput difference of ~14x between the dual E5-2620 (24t) and the 4core Viridis SoC.

Looking at power, an estimate of the energy required to do 6 repetitions:
Viridis = 7W * 12:18m * 6runs =~ 31kJ
x86 = 200W * 5:18m * 6/6 (all runs simultaneous ) =~ 64kJ

The ARM system is about twice as power efficient as the x86. It might be low power, but it takes a long time getting to the end.

What does this mean in practice? Imagine building a cluster to do nothing but run this code:

*) A Boston Viridis cluster built to the same power budget as an x86 one would
– have (200W/7W) ~ 28x the number of nodes
– a throughput (28 / 14x ) = ~2x that of the x86.

*) A calxeda cluster built to match throughput with the x86 one will
– need 14x the number of nodes
– require ~.5x the power

*) calxeda built to the same volume as an x86 one will have
– 36x # nodes (72/u / 2/u)
– 6x # cores (6 core counting HT)
– ~1.3x higher power draw ( 72*7W / 200W)
– ~2.7x the throughput of the x86. (36x/ 14x)

To conclude:
The Boston Viridis has a ~2x energy advantage on throughput compute-intensive workload, but this is substantially lower than the *power* advantage (~28x) would suggest because of markedly reduced performance (~1/14x) relative to the x86.

P.S: (Sell these things for Raspberry Pi prices and I’ll buy a container-load).

x86 running on ARM!

Today marked an important milestone in our product testing and development for our Viridis platform here at Boston. We can now officially confirm that we have run x86 binaries our on ARM based Viridis platform!

Over the last few weeks, we have been working with a group of engineers, from Eltechs, who are developing software to run x86 programs on ARM-based servers. This software could help lower one of the largest barriers to ARM SoC adoption as alternatives to Intel x86 processors in the datacentre.

Eltechs has developed a binary translator that acts as an emulator. The software currently delivers on average around 45% of native ARM performance. During our tests on the Viridis platform we observed up to 65% of native performance (6 tests were run covering a range of tests – details cannot be published at this time). We will be working with Eltechs on our Viridis platform, who believes it could reach 80% native ARM performance or greater in the future.

Of all the ARM products tested by Eltechs, we were delighted to hear our platform was received well:

The Boston server has been the fastest platform we have tested to date, Vadim Gimpelson, CEO of Eltech

We will continue to work with Eltechs in testing and validating our platform and hope to see further improvements as the software matures. In addition to our successful initial tests, we will be adding this software to the Boston ARM Wrestle program so if anyone has a particular code or application that hasn’t been ported to ARM, please get in touch with us at hpc@boston.co.uk to discuss benchmarking on our test cluster.