We would like to thanks our friends at the HPC service in Imperial College London for working with on the following tests and providing some great feedback:
The test performed was one of molecular docking using the Vina code: http://vina.scripps.edu/
Docking could potentially be a pretty good fit for this type of system because it’s the sort of thing that’s often run in ensembles, so is throughput-oriented. It’s CPU intensive, a mix of integer and fp.
On the ARM system, I compiled with the system Boost, g++ 4.6.3 and compiler flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -fpermissive
On my x86 system (dual E2620, turbo enabled, HT enabled) I used the distributed vina binary.
The test model is HIV protease and a ligand from the DUD docking test set.
Vina was run with:
vina --seed 0 --size_x 59.358 --center_x 4.486 --size_y 35.873 --center_y 0.8825 --size_z 38.609 --center_z 17.8075
I elected to run it with 4 threads, which is not the most efficient for maximising throughput (there’s a serial component at the start of the test), but I wanted a threaded component in the test, and I’ll correct for that in the analysis by using CPU time, rather than elapsed wall.
Here are the timings:
1 run, @4px: 2777.86 user 12:18.62 elapsed 376%CPU
For 6 TASKS:
x86: 278 minutes of CPU time
6 runs @4px: individual ave 1192.94 5:18.70 elapsed 374%CPU
For 6 TASKS: 19.9 minutes CPU time, 5:18m of walltime
So that’s a throughput difference of ~14x between the dual E5-2620 (24t) and the 4core Viridis SoC.
Looking at power, an estimate of the energy required to do 6 repetitions:
Viridis = 7W * 12:18m * 6runs =~ 31kJ
x86 = 200W * 5:18m * 6/6 (all runs simultaneous ) =~ 64kJ
The ARM system is about twice as power efficient as the x86. It might be low power, but it takes a long time getting to the end.
What does this mean in practice? Imagine building a cluster to do nothing but run this code:
*) A Boston Viridis cluster built to the same power budget as an x86 one would
- have (200W/7W) ~ 28x the number of nodes
- a throughput (28 / 14x ) = ~2x that of the x86.
*) A calxeda cluster built to match throughput with the x86 one will
- need 14x the number of nodes
- require ~.5x the power
*) calxeda built to the same volume as an x86 one will have
- 36x # nodes (72/u / 2/u)
- 6x # cores (6 core counting HT)
- ~1.3x higher power draw ( 72*7W / 200W)
- ~2.7x the throughput of the x86. (36x/ 14x)
The Boston Viridis has a ~2x energy advantage on throughput compute-intensive workload, but this is substantially lower than the *power* advantage (~28x) would suggest because of markedly reduced performance (~1/14x) relative to the x86.
P.S: (Sell these things for Raspberry Pi prices and I’ll buy a container-load).