Sometimes the most innocent looking pieces of codes can hurt the overall latency or throughput of the system. In this article, we’re going to see how false sharing can turn multithreading against us.
Let’s start with a simple contract for counting:
Each implementation encapsulates a collection of counters. The
inc method is responsible for incrementing the counter at the given
Let’s start with 8 distinct counters:
We’re going to increment each of these values atomically. Also, to avoid the extra memory footprint associated with the
AtomicLongs, we will use
inc(int index) method increments the counter at the given
The following benchmark would run the
inc method with 8 threads under heavy load. Each thread is going to call the method with a distinct
index. Therefore, there won’t be any contention as the thread
i operates on the ith counter:
Running this benchmark would result in something like:
On average each
inc call takes about 170 nano seconds to complete.
Let’s see if we can improve this situation.
Know Your Object Layout
If we run the following:
Then JOL will print the object layout for the
We can visualize this layout as something like:
It follows the typical OOP or Ordinay Object Pointer structure in JVM: an object header immediately followed by zero or more references to instance fields. The header itself consists of one mark word, one klass word, and a 32-bit array length (only for arrays) which is 12 bytes in total. Also, JVM may add some paddings for alignment purposes.
Immediately after object header and padding, there are 8 references to those 8
long instance fields.
We can also verify this layout using the lovely
Padding: Random Encounter
Let’s see how adding some paddings between counters affect the latency. First, isolating the counters from the object header by adding 6
long values before
long values plus 12 bytes of object header plus 4 bytes of padding added by JVM is equal to 64 bytes (More on this 64 magic number later).
Then we’re going to separate each counter by one
Here’s how JVM lays out the
Padded1Counter in heap:
If we benchmark
Padded1Counter against the
Quite surprisingly, the
Padded1Counter outperforms the
SimpleCounter in terms of latency.
Let’s add another 8 bytes between counters:
The memory layout turns out to be like this:
And the benchmark result is again interesting:
If we continue to add more paddings until we reach 7 bytes of padding, the memory layout would as isolated as:
Each counter resides on its own 64 bytes isolated area. Let’s see how’s the benchmark looks like after all this:
Performance Monitoring Counter
To instrument low-level CPU events, such as cycles, stall cycles, instructions, or memory loads/stores, one can program special hardware registers on the processors.
As it turns out, tools like perf or BPF are already using this approach to expose useful metrics. As of Linux 2.6.31, perf is the standard Linux profiler capable of exposing useful Performance Monitoring Counters or PMCs.
We can use perf_events to see what’s going on at the CPU level when running each of those benchmarks. For instance, if we run:
perf would run the benchmark against the
SimpleCounter and then expose some CPU PMCs as follows:
SimpleCounter benchmark had 1,036,004,767 (~ 1 billion) L1 data cache misses. If we run the same profiling with the
Padded7Counter encountered 120,239,626 cache misses, almost 900 million fewer cache misses compared to the simple approach.
What is the cause of such a difference between almost two identical implementations?
Cache Line and Coherency
CPU caches are working in terms of Cache Lines. That is, the cache line is the unit of transfer between CPU caches and the main memory.
Basically, processors tend to cache a few more values in addition to the requested one. This spatial locality optimization usually improves both throughput and latency of memory access.
However, when two or more threads are competing for the same cache line, multithreading may have a counterproductive effect, as they need to coordinate the cache access with each other.
Maintaining the uniformity of shared values in different caches is knowon as cache coherency. There are different protocols for maintaining the cache coherency such as MESI or Modified, Exclusive, Shared, and Invalid.
Let’s see how all this works in terms of the MESI coherency protocol. Core A and B are reading different values in different memory locations but in the same memory area. First, the Core A reads the value
As shown above, this core now has
exclusive access to this particular cache line. After a while, Core B reads the
v2 value. The
v2 are residing in the same cache line. Therefore, the cache line state changes to
shared, as two cores are sharing it:
From now on, if these two cores want to read any value from this cache line, they will avoid the memory access and just read the cache line. However, this efficiency can be fragile, as one of these cores may decide to modify something in this cache line.
For the sake of argument, let’s suppose Core B decides to increment the value of
Now Core B tags its cache line as
modified and communicates this state transition with the Core A. Core A, in turn, will re-tags its cache line as
invalid. Usually, the processors will buffer such modifications in their store buffers before flushing it back to the main memory. Buffering and flushing back in batches can be a huge performance boost.
Now let’s suppose Core A decides to read any value from its invalid cache line:
Since this cache line is invalid for Core A, it should read it again from the memory. This will force Core B to flush its modifications back to the memory. After that, both cores will cache the same line in the
Now imagine this process to happen millions of times each second. That’s the reason behind the poor performance of our simple counter: constant in-efficient memory access.
We saw different cores were reading from or writing to different memory locations. Despite that, because those different values happened to be in the same cache line, changing one value induced a cache miss to another core.
This phenomenon, known as false sharing, can impose a lot of cache misses to different cores. This will, in turn, degrade the overall performance of the application.
Quite interestingly, the common solution for false sharing is another memory-speed tradeoff.
To prevent false sharing, we can add enough garbage around each contended value:
This way each contended value will reside on its own cache line. Consequently, we will avoid unwanted contention between different cores, as each core gains exclusive access to its cache line. However, if two core operates on the same value, they will share the same cache line, which is reasonable.
In most modern hardware architectures, the cache line size is around 64 to 128 bytes. So, adding, say, 7 other
long values around the actual
long will probably isolate that
long variable in a separate cache line.
We say probably because memory layout of objects is a complicated subject in language runtimes such as JVM. Therefore, we should be sure about the layout before adding the padding.
Anyway, we can find the cache line size in macOS by asking from
Please note that, we should only add these sorts of paddings around contended values. Otherwise, we’re trading some more memory for nothing special in return!
Speaking of which, let’s see if there is a better way to avoid false sharing in Java.
jdk.internal.vm.annotation.Contended annotation (
sun.misc.Contended if you’re on Java 8) is a hint for the JVM to isolate the annotated fields to avoid false sharing. So we can rewrite our counters as the following:
Now if we inspect the memory layout of this class, we should see lots of padding around each field:
@Contended annotation is in an
internal package for a good reason. By default, only JDK internal abstractions can use this annotation. Therefore, there is no padding in the above example.
However, if we’re so persistent in shooting ourselves in the foot, we can use the
-XX:-RestrictContended tuning flag to remove this restriction:
@Contended adds 128 bytes padding around each annotated field:
As shown above, this value is also configurable through
-XX:ContendedPaddingWidth tuning flag.
@Contended version has similar performace characteristics to the one with 7 8-bytes padding:
In recent years, the
@Contended annotation has been used in JDK internal stuff to avoid false sharing. Here are a few notable examples:
Threadclass to store the random seed
- And the
Striped64to implement counters and accumulators with high throughput (sounds familiar?)
Also, here are a few more pointers on how this annotation works:
-XX:+EnableContendedtuning flag to enable/disable the padding. By default, it’s enabled
- Cotention groups
- If we annotate a whole class with
@Contended, then JVM will adds some padding before all the fields
- Padding for fields
Also, in the article, we briefly talked about the memory layout of objects in JVM. For a more detailed exploration, it’s highly recommended to check out the oops section of the JVM source code. Also, Aleksey Shipilëv has a much more in-depth article in this area.
Moreover, more examples of JOL are available as part of the project source code.
As usual, all the examples are available over on GitHub.