Docking Throughput on Viridis

We would like to thanks our friends at the HPC service in Imperial College London for working with on the following tests and providing some great feedback:

<snip>

The test performed was one of molecular docking using the Vina code: http://vina.scripps.edu/

Docking could potentially be a pretty good fit for this type of system because it’s the sort of thing that’s often run in ensembles, so is throughput-oriented. It’s CPU intensive, a mix of integer and fp.

On the ARM system, I compiled with the system Boost, g++ 4.6.3 and compiler flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -fpermissive

On my x86 system (dual E2620, turbo enabled, HT enabled) I used the distributed vina binary.

The test model is HIV protease and a ligand from the DUD docking test set.

Vina was run with:


vina --seed 0 --size_x 59.358 --center_x 4.486 --size_y 35.873 --center_y 0.8825 --size_z 38.609 --center_z 17.8075
--receptor receptor.pdbqt
--ligand ligand.pdbqt
--cpu 4

I elected to run it with 4 threads, which is not the most efficient for maximising throughput (there’s a serial component at the start of the test), but I wanted a threaded component in the test, and I’ll correct for that in the analysis by using CPU time, rather than elapsed wall.

Here are the timings:

ARM

1 run, @4px: 2777.86 user 12:18.62 elapsed 376%CPU

For 6 TASKS:
x86: 278 minutes of CPU time

6 runs @4px: individual ave 1192.94 5:18.70 elapsed 374%CPU

For 6 TASKS: 19.9 minutes CPU time, 5:18m of walltime

So that’s a throughput difference of ~14x between the dual E5-2620 (24t) and the 4core Viridis SoC.

Looking at power, an estimate of the energy required to do 6 repetitions:
Viridis = 7W * 12:18m * 6runs =~ 31kJ
x86 = 200W * 5:18m * 6/6 (all runs simultaneous ) =~ 64kJ

The ARM system is about twice as power efficient as the x86. It might be low power, but it takes a long time getting to the end.

What does this mean in practice? Imagine building a cluster to do nothing but run this code:

*) A Boston Viridis cluster built to the same power budget as an x86 one would
– have (200W/7W) ~ 28x the number of nodes
– a throughput (28 / 14x ) = ~2x that of the x86.

*) A calxeda cluster built to match throughput with the x86 one will
– need 14x the number of nodes
– require ~.5x the power

*) calxeda built to the same volume as an x86 one will have
– 36x # nodes (72/u / 2/u)
– 6x # cores (6 core counting HT)
– ~1.3x higher power draw ( 72*7W / 200W)
– ~2.7x the throughput of the x86. (36x/ 14x)

To conclude:
The Boston Viridis has a ~2x energy advantage on throughput compute-intensive workload, but this is substantially lower than the *power* advantage (~28x) would suggest because of markedly reduced performance (~1/14x) relative to the x86.

P.S: (Sell these things for Raspberry Pi prices and I’ll buy a container-load).