Docking Throughput on Viridis

We would like to thanks our friends at the HPC service in Imperial College London for working with on the following tests and providing some great feedback:

<snip>

The test performed was one of molecular docking using the Vina code: http://vina.scripps.edu/

Docking could potentially be a pretty good fit for this type of system because it’s the sort of thing that’s often run in ensembles, so is throughput-oriented. It’s CPU intensive, a mix of integer and fp.

On the ARM system, I compiled with the system Boost, g++ 4.6.3 and compiler flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -fpermissive

On my x86 system (dual E2620, turbo enabled, HT enabled) I used the distributed vina binary.

The test model is HIV protease and a ligand from the DUD docking test set.

Vina was run with:


vina --seed 0 --size_x 59.358 --center_x 4.486 --size_y 35.873 --center_y 0.8825 --size_z 38.609 --center_z 17.8075
--receptor receptor.pdbqt
--ligand ligand.pdbqt
--cpu 4

I elected to run it with 4 threads, which is not the most efficient for maximising throughput (there’s a serial component at the start of the test), but I wanted a threaded component in the test, and I’ll correct for that in the analysis by using CPU time, rather than elapsed wall.

Here are the timings:

ARM

1 run, @4px: 2777.86 user 12:18.62 elapsed 376%CPU

For 6 TASKS:
x86: 278 minutes of CPU time

6 runs @4px: individual ave 1192.94 5:18.70 elapsed 374%CPU

For 6 TASKS: 19.9 minutes CPU time, 5:18m of walltime

So that’s a throughput difference of ~14x between the dual E5-2620 (24t) and the 4core Viridis SoC.

Looking at power, an estimate of the energy required to do 6 repetitions:
Viridis = 7W * 12:18m * 6runs =~ 31kJ
x86 = 200W * 5:18m * 6/6 (all runs simultaneous ) =~ 64kJ

The ARM system is about twice as power efficient as the x86. It might be low power, but it takes a long time getting to the end.

What does this mean in practice? Imagine building a cluster to do nothing but run this code:

*) A Boston Viridis cluster built to the same power budget as an x86 one would
- have (200W/7W) ~ 28x the number of nodes
- a throughput (28 / 14x ) = ~2x that of the x86.

*) A calxeda cluster built to match throughput with the x86 one will
- need 14x the number of nodes
- require ~.5x the power

*) calxeda built to the same volume as an x86 one will have
- 36x # nodes (72/u / 2/u)
- 6x # cores (6 core counting HT)
- ~1.3x higher power draw ( 72*7W / 200W)
- ~2.7x the throughput of the x86. (36x/ 14x)

To conclude:
The Boston Viridis has a ~2x energy advantage on throughput compute-intensive workload, but this is substantially lower than the *power* advantage (~28x) would suggest because of markedly reduced performance (~1/14x) relative to the x86.

P.S: (Sell these things for Raspberry Pi prices and I’ll buy a container-load).

3 thoughts on “Docking Throughput on Viridis

  1. Please advise how many Calxeda quads were benchmarked verse dual Xeon 2620? And what frequency were the calxeda parts? Thank you.

    • The guys at Imperial tested a single, 1.1GHz quad-core SOC against a dual Xeon E5-2620 system.

      Naturally, the dual Xeon system was faster than a single SOC but their conclusion showed the ARM architecture of the Viridis was twice as energy efficient as the x86 architecture for a given performance.

  2. Thank you for clarifcation of single Calxeda 1.1 GHz quad v Xeon E5 2620 2 GHz.

    Please advise on my take away from your assessment of the Imperial College tests:

    So as I understand from the benchmark result, seven 32 bit ARM 1.1 GHz quads equal the processing performance of one Xeon 2620 hexa 2.0 GHz in this Intel loaded molecular docking benchmark; http://www.lowpowerservers.com/?p=141. Where one Calxeda quad how likely was thrashed in this Intel loaded benchmark verse dual Xeon 2620’s? And does Vina code for molecular docking require FPU? No wonder TI ARM Server HPC DSP group has an investment in Calxeda.

    Xeon 2620 sells for $410 in 1,000 unit quantities. Not taking into account added system blocks that are BSM, I/O, NIC, Calxeda silicon is then valued at $59 each ($410/7) which flies under Intel average fixed cost.

    But might those Calxeda quads running 55% the frequency of 2620 be valued at $114?

    On hexa core equal basis $171?

    With BSM, I/O, NIC $198 placing Energy Core at Intel average total cost.

    On power what is the requirement to bridge seven quads on blade?

    Is there a value message here for ARM Server SOC on embedded feature set that is not getting through?

    Appreciate your view on value consideration.

    Mike Bruzzone
    Camp Marketing

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>