False Sharing
Sometimes the most innocent looking pieces of codes can hurt the overall latency or throughput of the system. In this article, we’re going to see how false sharing can turn multithreading against us.
Let’s start with a simple contract for counting:
Each implementation encapsulates a collection of counters. The inc
method is responsible for incrementing the counter at the given index
.
Atomic Counters
Let’s start with 8 distinct counters:
We’re going to increment each of these values atomically. Also, to avoid the extra memory footprint associated with the AtomicLong
s, we will use VarHandle
s here:
The inc(int index)
method increments the counter at the given index
:
First Benchmark
The following benchmark would run the inc
method with 8 threads under heavy load. Each thread is going to call the method with a distinct index
. Therefore, there won’t be any contention as the thread i
operates on the ith counter:
Running this benchmark would result in something like:
On average each inc
call takes about 170 nano seconds to complete.
Let’s see if we can improve this situation.
Know Your Object Layout
If we run the following:
Then JOL will print the object layout for the SimpleCounter
:
We can visualize this layout as something like:
It follows the typical OOP or Ordinay Object Pointer structure in JVM: an object header immediately followed by zero or more references to instance fields. The header itself consists of one mark word, one klass word, and a 32-bit array length (only for arrays) which is 12 bytes in total. Also, JVM may add some paddings for alignment purposes.
Immediately after object header and padding, there are 8 references to those 8 long
instance fields.
We can also verify this layout using the lovely sun.misc.Unsafe
:
Padding: Random Encounter
Let’s see how adding some paddings between counters affect the latency. First, isolating the counters from the object header by adding 6 long
values before v1
:
6 long
values plus 12 bytes of object header plus 4 bytes of padding added by JVM is equal to 64 bytes (More on this 64 magic number later).
Then we’re going to separate each counter by one long
value:
Here’s how JVM lays out the Padded1Counter
in heap:
If we benchmark Padded1Counter
against the SimpleCounter
:
Quite surprisingly, the Padded1Counter
outperforms the SimpleCounter
in terms of latency.
Let’s add another 8 bytes between counters:
The memory layout turns out to be like this:
And the benchmark result is again interesting:
If we continue to add more paddings until we reach 7 bytes of padding, the memory layout would as isolated as:
Each counter resides on its own 64 bytes isolated area. Let’s see how’s the benchmark looks like after all this:
Performance Monitoring Counter
To instrument low-level CPU events, such as cycles, stall cycles, instructions, or memory loads/stores, one can program special hardware registers on the processors.
As it turns out, tools like perf or BPF are already using this approach to expose useful metrics. As of Linux 2.6.31, perf is the standard Linux profiler capable of exposing useful Performance Monitoring Counters or PMCs.
We can use perf_events to see what’s going on at the CPU level when running each of those benchmarks. For instance, if we run:
Then perf
would run the benchmark against the SimpleCounter
and then expose some CPU PMCs as follows:
The SimpleCounter
benchmark had 1,036,004,767 (~ 1 billion) L1 data cache misses. If we run the same profiling with the Padded7Counter
:
The Padded7Counter
encountered 120,239,626 cache misses, almost 900 million fewer cache misses compared to the simple approach.
What is the cause of such a difference between almost two identical implementations?
Cache Line and Coherency
CPU caches are working in terms of Cache Lines. That is, the cache line is the unit of transfer between CPU caches and the main memory.
Basically, processors tend to cache a few more values in addition to the requested one. This spatial locality optimization usually improves both throughput and latency of memory access.
However, when two or more threads are competing for the same cache line, multithreading may have a counterproductive effect, as they need to coordinate the cache access with each other.
Maintaining the uniformity of shared values in different caches is knowon as cache coherency. There are different protocols for maintaining the cache coherency such as MESI or Modified, Exclusive, Shared, and Invalid.
Let’s see how all this works in terms of the MESI coherency protocol. Core A and B are reading different values in different memory locations but in the same memory area. First, the Core A reads the value v1
:
As shown above, this core now has exclusive
access to this particular cache line. After a while, Core B reads the v2
value. The v1
and v2
are residing in the same cache line. Therefore, the cache line state changes to shared
, as two cores are sharing it:
From now on, if these two cores want to read any value from this cache line, they will avoid the memory access and just read the cache line. However, this efficiency can be fragile, as one of these cores may decide to modify something in this cache line.
For the sake of argument, let’s suppose Core B decides to increment the value of v2
:
Now Core B tags its cache line as modified
and communicates this state transition with the Core A. Core A, in turn, will re-tags its cache line as invalid
. Usually, the processors will buffer such modifications in their store buffers before flushing it back to the main memory. Buffering and flushing back in batches can be a huge performance boost.
Now let’s suppose Core A decides to read any value from its invalid cache line:
Since this cache line is invalid for Core A, it should read it again from the memory. This will force Core B to flush its modifications back to the memory. After that, both cores will cache the same line in the shared
state.
Now imagine this process to happen millions of times each second. That’s the reason behind the poor performance of our simple counter: constant in-efficient memory access.
False Sharing
We saw different cores were reading from or writing to different memory locations. Despite that, because those different values happened to be in the same cache line, changing one value induced a cache miss to another core.
This phenomenon, known as false sharing, can impose a lot of cache misses to different cores. This will, in turn, degrade the overall performance of the application.
Quite interestingly, the common solution for false sharing is another memory-speed tradeoff.
Padding Revisited
To prevent false sharing, we can add enough garbage around each contended value:
This way each contended value will reside on its own cache line. Consequently, we will avoid unwanted contention between different cores, as each core gains exclusive access to its cache line. However, if two core operates on the same value, they will share the same cache line, which is reasonable.
In most modern hardware architectures, the cache line size is around 64 to 128 bytes. So, adding, say, 7 other long
values around the actual long
will probably isolate that long
variable in a separate cache line.
We say probably because memory layout of objects is a complicated subject in language runtimes such as JVM. Therefore, we should be sure about the layout before adding the padding.
Anyway, we can find the cache line size in macOS by asking from sysctl
:
On Linux:
Please note that, we should only add these sorts of paddings around contended values. Otherwise, we’re trading some more memory for nothing special in return!
Speaking of which, let’s see if there is a better way to avoid false sharing in Java.
@Contended
The jdk.internal.vm.annotation.Contended
annotation (sun.misc.Contended
if you’re on Java 8) is a hint for the JVM to isolate the annotated fields to avoid false sharing. So we can rewrite our counters as the following:
Now if we inspect the memory layout of this class, we should see lots of padding around each field:
OOPS! The @Contended
annotation is in an internal
package for a good reason. By default, only JDK internal abstractions can use this annotation. Therefore, there is no padding in the above example.
However, if we’re so persistent in shooting ourselves in the foot, we can use the -XX:-RestrictContended
tuning flag to remove this restriction:
By default, @Contended
adds 128 bytes padding around each annotated field:
As shown above, this value is also configurable through -XX:ContendedPaddingWidth
tuning flag.
Anyway, the @Contended
version has similar performace characteristics to the one with 7 8-bytes padding:
Conclusion
In recent years, the @Contended
annotation has been used in JDK internal stuff to avoid false sharing. Here are a few notable examples:
- The
Thread
class to store the random seed - The
ForkJoinPool
executor - The
Exchanger
concurrent utility - The
ConcurrentHashMap
- And the
Striped64
to implement counters and accumulators with high throughput (sounds familiar?)
Also, here are a few more pointers on how this annotation works:
-XX:+EnableContended
tuning flag to enable/disable the padding. By default, it’s enabled- Cotention groups
- If we annotate a whole class with
@Contended
, then JVM will adds some padding before all the fields - Padding for fields
Also, in the article, we briefly talked about the memory layout of objects in JVM. For a more detailed exploration, it’s highly recommended to check out the oops section of the JVM source code. Also, Aleksey Shipilëv has a much more in-depth article in this area.
Moreover, more examples of JOL are available as part of the project source code.
As usual, all the examples are available over on GitHub.