Boost OpenFOAM Simulation Speed with Intel Xeon Phi

Xeon Phi “Knights Landing”-Processor Crafted for Massively Parallel Simulation
For high performance simulation experts, the Intel® Xeon® series is probably the de-facto standard CPU for computations nowadays. Nevertheless, if you are keen to get the optimal efficiency for OpenFOAM, you should not miss another nice alternative, the Intel® Xeon Phi™ processor family, which was first launched in 2013 as a co-processor, and then updated in 2016 as a host-processor which bearsthe code name “Knights Landing” (KNL).
As a host processor, Xeon Phi KNL behaves not so much differently from the traditional Xeon in the way that Xeon Phi also boots up a server by itself, runs the operating system, and is compatible for most codes on Xeon. To gain higher performance, recompilation with Xeon Phi-specific flags and modernization of codes towards the Xeon Phi architecture might be needed. Luckily, for OpenFOAM, such effort has mostly been done by Intel.
Having said that, Xeon Phi of course has its own unique features in contrast to Xeon, which are listed as follows:
  • 64 – 72 cores are built into a Xeon Phi processor in contrast to 12 – 22 cores on a Xeon “Broadwell” chip.
  • High-bandwidth memory, integrated into the Xeon Phi chip, provides 4x memory bandwidth than traditional DDR-4 memory on the Xeon platform.
  • The vector unit of Xeon Phi is twice as wide as of Xeon, allowing twice more operands to be calculated at a time if the code is vectorized.
  • Intel Omni-Path 100 Gbit interconnect can be integrated into the CPU, which further reduces the cost of an additional PCIe adapter.
The following figure depicts the architecture of Intel Xeon Phi KNL.

Source: Intel
To make efficient use of Intel Xeon Phi processors, Fujitsu has developed the PRIMERGY CX600 M1 server platform. The 2-HU platform houses 8x PRIMERGY CX 1640 M1 compute nodes, four on the front and four on the rear side. Each node is equipped with one KNL CPU. As of April 2017, Fujitsu has finished two large installations of this server type at Jülich Supercomputing Center in Germany (QPACE3) and at Joint Center for Advanced HPC nearby Tokyo in Japan (Oakforest-PACS). The rear side of CX 600 is shown below.

OpenFOAM on Xeon Phi
In my last article << Drawing realistic CPU-performance expectations >>, I wrote that an application’s performance could be influenced by several factors in the CPU instead of only by the frequency. Now you may wonder how KNL’s strengths apply to OpenFOAM.
OpenFOAM typically scales much better across nodes than within a node, and in particular, not with high core-count Xeon CPUs. The main reason for such a phenomenon is memory bandwidth. As OpenFOAM usually consumes all memory bandwidth on a Xeon node, its performance is more determined by the memory bandwidth, which is fixed on a single node, than by the computational power of the processor, which is proportional to the number of cores.
Xeon Phi KNL, with its 16 GB of integrated high-bandwidth memory, is just the right cure for OpenFOAM’s memory-bandwidth bound behavior. While you might think 16 GB too small, it is actually not when the model has been decomposed. In our analysis, a 100-million-cell model, when run on 32 nodes, only consumes around 18 GB per node. When the model is larger than the high-bandwidth memory, the standard DDR-4 memory will automatically be used.
In fact, build rules for Xeon Phi KNL have been added to OpenFOAM since the OpenCFD release v1612+ and the OpenFOAM Foundation release 4.1+. One just needs to write WM_COMPILER=GccKNL or IccKNL in the etc/bashrc file, then KNL-specific compilation options will be applied. On top of that, one is encouraged to used Intel-optimized GaussSeidel and symGaussSeidel smoothers to obtain higher performance. The code is freely available at
In summary, OpenFOAM’s biggest performance bottleneck is addressed by the high-bandwidth memory on Xeon Phi KNL. OpenFOAM’s source code is officially ready for Xeon Phi – just specify the right build rule, then you get the right performance.
Field measurements
After all, theories are only useful when proven by experiments. To show that Xeon Phi is indeed the better choice than Xeon for OpenFOAM, we evaluated the performance with a real-world simulation model – the motorbike case, as shown below, from OpenFOAM’s standard tutorial suite from the The OpenFOAM Foundation.

The original model in the tutorial only has three million cells. To simulate more realistic workloads, we have refined the mesh to make it 88 million cells. The remaining software configurations are provided below.
Incompressible flow
4.1 + Intel-optimized smoothers for KNL
GCC 6.1.0
Intel MPI 2017 update 1
The configurations of the studied hardware platforms – Xeon and Xeon Phi – are provided in the following table. Be noted that the Xeon platform is about 25% more expensive than Xeon Phi in this case.
Xeon Phi
Compute Node
Fujitsu PRIMERGY CX 2550 M2
Fujitsu PRIMERGY CX 1640 M1
2x Intel Xeon E5-2690 v4
14 cores/CPU, 2.9 GHz
1x Intel Xeon Phi 7210 (KNL)
64 cores/CPU, 1.3 GHz
High Bandwidth Memory
16 GB
DDR Memory
128 GB DDR4-2400
192 GB DDR4-2133
Intel Omni-Path
Intel Omni-Path
Peak Power Consumption
394 Watt
321 Watt
We have summarized our measurement results in the following chart. As it can be seen from the height of the bars, Xeon Phi always finished the simulation significantly ahead of Xeon. On average, with Xeon Phi, you can run 30% more simulations per day than with Xeon.

But Xeon Phi is not just more powerful than Xeon, it is more superior than Xeon in all aspects of performance-price, performance-power, and performance-space (high unit). We have calculated these numbers in the following chart, based on single-node measurements. Be noted that all numbers have been normalized against the Xeon platform.

Fujitsu’s PRIMERGY CX 1640 M1 compute node together with Intel’s Xeon Phi processor unleashes the performance of OpenFOAM with a massively parallel core architecture and high bandwidth memory. The application has been optimized for Xeon Phi for maximum performance. Tested with a realistic 88-million-cell incompressible-flow simulation model, we have shown three advantages of the Fujitsu’s PRIMERGY CX 1640 M1 platform for Xeon Phi:
  1. High performance-price: One Xeon Phi node run each simulation 1.7x faster than a 2-socket Xeon node of the same price.
  2. Energy efficient: One Xeon Phi node consumes 2/3 of energy of a 2-socket Xeon node for every simulation.
  3. Space efficient: One Xeon Phi node takes only 1/3 of space of a 2-socket Xeon node for every simulation.
If you are interested in our Fujitsu PRIMERGY CX 1640 M1/CX 600 M1 offering for Xeon Phi, please contact our sales representative.
Additionally you also have the possibility to make your own first-hand experience of the power of an HPC environment with the advantage of Intel® Xeon Phi™ processors, for free. Register for access and you can try for yourself a system which is preloaded with a set of applications ready to use, or alternatively, you bring your own codes. Fujitsu’s HPC team will provide technical support and assistance to ensure you get the highest return from your experience.
Register and experience the benefits now!


Teile diesen Artikel mit Deinen Freunden:

About the author

Chih-Song Kuo has five years of technical experience in the HPC field. As a Senior Sales Consultant - HPC Benchmark Specialist at ICT, he analyzes customer application workloads and optimizes the performance by suggesting the most efficient hardware solution together with the most effective software run-time parameters. Prior to his current job, he has worked on several top 10 HPC systems in the world such as TSUBAME2.5 and JUQUEEN. He holds a master degree in Software Systems Engineering from RWTH Aachen University.