The Myth of Hardware Performance
Have you always been surprised by hardware vendors who consistently claim big performance jumps on their new products? On the contrary, have you ever seen the same degree of speedup on your application workload? If not, have you wondered why? Are the figures just marketing tricks?
In fact, we as a system provider often have to explain why a new CPU does not bring the same percentage of performance improvement as it is advertised by the CPU manufacturer. Actually, all numbers that you see are facts, but they just cannot be applied to all applications, which are behaving so differently from each other. In this blog post, I am going to explain the reason for such inconsistency in details.
CPU Becomes Faster in Several Ways
Thanks to Moore’s Law, transistor sizes have been shrinking from one year to another, allowing CPUs to have more sophisticated circuits that bring us higher performance. Such a trend can be seen in the chart below. Be noted that the y-axis is shown in log scale.
In the early time of CPU development, the ever-increasing CPU frequency was the main driving force for higher performance. However, such an approach has later been proven economically and technically inefficient since the dynamic power consumption increases proportionally to the operating frequency. An article by Intel explained the reason in more details. Making a 10GHz CPU would mean four times of power consumption to the current CPUs running at 2,4GHz. Since a single server is already consuming 450 Watts now, four times of that would mean 1800 Watts, which is even higher than the power consumption of a grill plate. In fact, not only power consumption is a problem, but also getting rid of the huge heat generated by such high power is more likely an unrealistic task.
For these reasons, chip vendors have chosen to enhance other parts of CPU and even to slowly reduce the frequency in order to increase performance without raising the power budget. Unfortunately, these parts appear in different flavors and hence their influence are hard to understand. In the following, I will list a few factors that dominate the most.
Cores per CPU: If an application is well parallelized, then the higher the core count is, the more workloads can be processed within the same time. In fact, such a practice, called parallel computing, has become the main driving force of high performance computing in the recent 30 years.
Vector unit width: In fact, parallel computing does not only mean using several cores at the same time, but also exploiting the “parallel” circuits inside each single core. One of such approach is to process data in a vectorized way, meaning that a mathematical operator is applied to several data entries at a time. The ability of performing vector computation is often referred to as data parallelism, or “single instruction stream, multiple data streams (SIMD)”, as illustrated in the following diagram.
Traditionally, an operator in a CPU core can be applied to only one data stream. Nowadays, somes CPU can process eight sets of double-precision floating-point operands at a time. Obviously, the wider the vector unit is, the more data can be computed simultaneously. However, since this capability is still rather new, not every simulation software has fully adapted itself to take the advantage.
Execution ports per core: The execution port represents another kind of parallelism in a way that a single CPU core contains several “calculators” which can operate in parallel as long as the power budget permits. In general, higher-end CPUs typically have more ports. The number of execution ports distinguishes server CPUs from normal desktop CPUs. Moreover, these execution ports are classified by their capability listed as follows.
Data type of operand: Integers and floating-point numbers.
Scalar vs. vector operand: Whether one arithmetic operation is applied to one or several data streams.
Arithmetic operations: Additions, subtractions, multiplication, divisions, logical operations, and their mixtures. For example, most recently, there came out an operation called fused multiply-add (FMA) which performs a × b + c at once. FMA is beneficial to programs with many matrix operations.
Other circuits: These are circuits that determine the cache size, the number of instructions that can be decoded and issued at each cycle, and that decides the number of instructions which can be reordered for the sake of higher execution efficiency.
LINPACK: The Standard Benchmark that Characterizes CPU Performance
Since frequency cannot fully reflect the performance of CPU, an additional metric is needed to account for all previously mentioned aspects. LINPACK, a benchmark that carries out intensive matrix operations, has somehow been agreed by most HPC experts to bear this role. Because LINPACK’s behavior is well known, it is possible to predict the benchmark’s ideal performance, measured in floating-point operations per second (FLOPS), with the formula below:
FLOPS = frequency × number of cores × number of ports capable of multiply-add × operations done in a single multiply-add × vector width / variable width
Let us take one real CPU, Intel E5-2690v4 for example.
Frequency = 2.6 GHz
Number of cores = 14
Number of ports capable of multiply-add = 2
Operations done in a single multiply-add = 2 (“multiple” and “add”)
Vector width = 256 bits
Variable width = 64 bits (double-precision floating point numbers)
Plug in these numbers, we get
2.6G × 14 × 2 × 2 × 256 / 64 = 582.4 GFlops per CPU
This is the “peak performance (Rpeak)” quoted by almost all hardware vendors. In the HPC community, Rpeak is the de-facto standard measure of CPU performance.
Is Rpeak a Good Indicator of Your Application Performance?
Although there exist plenty of applications that behave similar to LINPACK, especially in the area of molecular dynamics and quantum chromodynamics, after all, chances are high that your application behaves differently from LINPACK and hence does not benefit from all kinds of features aiming at maximizing the LINPACK score. For example, in year 2014, with the launch of Intel Haswell-EP processors, fused-multiply-add was introduced and hence doubled the LINPACK score from the previous Ivybridge-EP processors. However, if your application is not manipulating matrices as the main job, you are unlikely to see the same speedup.
The diagram below shows the relative performance for codes that support different set of CPU features. The orange line represents LINPACK which perfectly supports all performance features on every newest CPU, and therefore is increasing steeply. But if your code does not support vectorization, or is poorly parallelized, then the performance gain is much more modest from one CPU generation to another, as indicated by the blue and grey lines.
Conclusion: Diagnose Your Application, or We Make It for You
From here, it can be concluded that, Rpeak, a metric defined by the LINPACK benchmark, is not an appropriate performance indicator for all applications. It makes strong assumptions on a certain type of application behaviors, and hence draws inaccurate expectations on the application performance. Although much effort has been spent on defining a new standard benchmark that represents a wider spectrum of real-world applications, this task has been proven to be more challenging than it was thought to be. Therefore, it remains almost impossible to choose the right CPU based on a single benchmark value. The only reliable way is to check all aspects of the run-time behavior of an application, and then choose the CPU that fits best to such a behavior.
To save our customer from such a headache, at ict GmbH, Fujitsu HPC Competence Center Aachen, we offer two choices that select the most suitable CPU and draw the most realistic performance expectations in a hassle-free manner:
(1) PRIMEFLEX for HPC is a collection of pre-defined solutions that are optimized based on field analysis of most popular ISV applications with a wide range of reference input cases. This choice leverages optimal cost-effectiveness with the fastest solution-delivery schedule.
(2) Customer-tailored solution requires the customer to submit us a collection of most representative benchmark cases which are then executed on various platforms and analyzed with top care. Customer-tailored solution ensures the best selection of CPU in terms of application performance – price ratio.
Please contact our sales representative if you are interested in our offers.