FhGFS on ARM

Image

FhGFS is short for Fraunhofer Parallel File System (or Fraunhofer FS) and is developed at the Fraunhofer Institute for Industrial Mathematics (ITWM) in Kaiserslautern, Germany. It can be downloaded and used free of charge from the project’s website http://www.fhgfs.com.Fraunhofer FS

FhGFS is a parallel file system, developed and optimized for high-performance computing and is implemented using a distributed metadata architecture for scalability.

Environment

During the multi-stream test, a constant amount of eight clients was used and the number of storage servers scaled from two to eight.

Striping in FhGFS can be configured on a per-directory and per-file basis. Each directory has a specific stripe pattern configuration, which will be derived to new subdirectories and applied to any file created inside a directory. There are currently two parameters that can be configured for the standard RAID-0 stripe pattern: the desired number of storage targets for each file and the chunk size (or block size) for each file stripe.

Striping for this test was configured across the number of storage servers.

Benchmark Specification – Multi-stream throughput

In this benchmark the total throughput of sequential read and write requests with multiple streams were measured.

FhGFS MultiStream Throughput

Intel® Atom® vs Calxeda® ARM®

Intel have released their latest version of their low power CPU the Intel Atom.   The Atom was designed to take on ARMs dominance in the mobile and low power computing market.  So here a Boston Labs we have compared its performance against our very own ARM based Viridis Server.

Comparison of the Intel Atom C2750 and Calxeda ARM A9

ARM A9 ARM A15 Intel Atom C2750
Cores 4 4 8
Threads 4 4 8
Clock Speed 1.1 1.5 2.4
Instruction Set 32 32 64
Cache 4mb 4mb 4mb

Test Systems

The Intel Atom has been tested using the Supermicro 5018A-TN4.   The System has a single Intel Atom. The System was running Centos 6.4 with the latest updates.

http://www.supermicro.com/products/system/1U/5018/SYS-5018A-TN4.cfm

The Calxeda System was a single System on Chip that is part of a 48 Node Viridis System.  The node was running Ubuntu 1304 with the latest updates.

Benchmarks

A range of benchmarks have been tested to compare the overall performance of the two systems: stream, LMbench and Coremark.

Coremark is used to compare the performance of a single core of a CPU.  This is an important comparison here due to the differences in clock speeds and core count. Stream measures the memory bandwidth for the CPUs  accross four operations: Copy, Scale, Add and Triadd.

LMbench compares the CPUs latency and bandwidths across operations using integers, floats and networking.   The results give an indication of the strengths and weaknesses of a CPU, and can be used to suggest the best option.

Results

armaton_coremark

The Atom core achieved just over twice the number of iterations as the Calxeda Core when running coremark.. This is to be expected as the Atom has a Clock speed 2.18 time faster.  When comparing the core mark results it can be seen that this is reflected in the results with a factor of 2.21 difference.

armatom_stream

The stream benchmark results are showing the the graph above.  As expected there has been a noticeable increase in performance between the two generations of ARM CPUs.

The ARM CPUs have half the number of cores as the Atom.  When taking this along with the slower clock speed we would expect the Atom to produce the best performance, but the results do suggest that the ARM CPU is likely to match the Atoms performance with the next generation.

The LMBench Benchmark produces a large amount of data on the CPUs, too much to show here.  But these are some of the more interesting points.

The Graph below shows the time taken to perform mathematical operations on floats and doubles.  These are fundamental operations in computing and are used heavily particularly in scientific computing.  The performance of these operations will give an indication of the overall performance of a code using the CPU.  The Add and Multiply operations give the performance we would expect with the Atom performing around four times as fast.

The divide operation requires a large amount of time in comparison to the other operations which is why high performance codes often try to minimise their use.  The Atom appears to only perform divisions around twice as fast the the ARM A9 CPU.

The Arm A15 has improved its performance significantly in comparison to the A9.  The next generation will move to a 64 bit architecture which should allow it to compete with the Atom.

Atom-Arm_float