Thread Local Randoms in Java
-- John Von Neumann
Java 7 introduced the ThreadLocalRandom
to improve the random number generation throughput in highly contended environments.
The rationale behind ThreadLocalRandom
is simple: instead of sharing one global Random
instance, each thread maintains its own version of Random
. This, in turn, reduces the contention and hence the better throughput.
Since this is such a simple idea, we should be able to roll our sleeves up and implement something like ThreadLocalRandom
with a comparable performance, right?
Let’s see.
First Attempt
In our first attempt, we’re going to use something as simple as ThreadLocal<Random>
:
In this benchmark, we’re comparing the Random
, our own ThreadLocal<Random>
and the builtin ThreadLocalRandom
:
The ThreadLocalRandom
generated around 1 billion random numbers per second, dwarfing the ThreadLocal<Random>
and plain Random
approaches by a whopping margin.
The Linear Congruential Method
By far the most popular random number generator in use today is known as Linear Congruential Pseudorandom Number Generator, introduced by D. H. Lehmer in 1949.
This algorithm starts with an initial seed value, X0 . Then generates each random variable from the previous one as following:
In the real world, it’s more practical to use just one variable to implement this. So, we should update the seed
value every time we generate a number.
Quite interestingly, the next(int)
method in the Random
class is responsible for this:
Since multiple threads can potentially update the seed
value concurrently, we need some sort of synchronization to coordinate the concurrent access. Here Java is using a Lock Free approach with the help of atomics.
Basically, each thread tries to change the seed value to a new one atomically via compareAndSet
. If a thread fails to do so, it will retry the same process until it could successfully commit the update.
When the contention is high, the number of CAS failures increases. That’s the main reason behind the poor performance of Random
in concurrent environments.
No More CAS
In implementations based on ThreadLocal
, the seed values are confined to each thread. Therefore, we don’t need atomics or any other form of synchronization for that matter as there is no shared mutable state:
If we run the same benchmark again:
MyThreadLocalRandom
has almost twice the throughput of our simple ThreadLocal<Random>
.
The compareAndSet
provides atomicity and memory ordering guarantees that we simply don’t need in a thread confined context. Since those guarantees are costly and unnecessary, removing them increases the throughput substantially.
However, we’re still behind the builtin ThreadLocalRandom
!
Removing Indirection
As it turns out, each thread doesn’t need a distinct and complete copy of Random
for itself. It just needs the latest seed
value, that’s it.
As of Java 8, these values have been added to the Thread
class itself:
Instead of using something like MyThreadLocalRandom
, each thread instance maintains its current seed
value inside the threadLocalRandomSeed
variable.
As a result, the ThreadLocalRandom
class becomes a singleton:
Every time we call the ThreadLocalRandom.current()
, it lazily initializes the threadLocalRandomSeed
and then returns that singelton:
Hashtable Lookup
With MyThreadLocalRandom
, every call to the current()
factory method, translates to a hash value computation for the ThreadLocal
instance and a lookup in the underlying hashtable.
On the contrary, with this new Java 8+ approach, all we have to do is to read the threadLocalRandomSeed
value directly and update it afterward.
Efficient Memory Access
In order to update the seed value, java.util.concurrent.ThreadLocalRandom
needs to change the threadLocalRandomSeed
state in the java.lang.Thread
class. If we make the state public
, then everybody can potentially update the threadLocalRandomSeed
, which is not that good.
We can use reflection to update the non-public state, but just because we can, does not mean that we should!
As it turns out, the ThreadLocalRandom
uses the Unsafe.putLong
native method to update the threadLocalRandomSeed
state efficiently:
Basically the putLong
method writes the r
value to some memory address relative to the current thread. The memory offset already calculated by calling another native method, i.e. Unsafe.objectFieldOffset
.
As opposed to reflection, all these methods have native implementations and are very efficient.
False Sharing
CPU caches are working in terms of Cache Lines. That is, the cache line is the unit of transfer between CPU caches and the main memory.
Basically, processors tend to cache a few other values along with the requested one. This spatial locality optimization usually improves both throughput and latency of memory access.
However, when two or more threads are competing for the same cache line, multithreading may have a counterproductive effect.
To better understand this, let’s suppose that the following variables are residing in the same cache line:
A few threads are using the tid
or thread id for some unknown purpose. Now, if we update the, say, threadLocalRandomSeed
in our thread to generate a random number, nothing bad should happen, right? It does not sound that much of a deal as some threads are reading the tid
and another one writes to a whole another memory location.
Despite what we might think, since all those values are in the same cache line, the reading threads will encounter a cache miss. And the writer needs to flush its store buffer. This phenomenon, known as false sharing, incurs a performance hit to our multithreaded application.
Padding
To avoid the false sharing issue we can add some padding around the contended values. This way each of those highly contended values will reside on their own cache line:
Cache line size is usually 64 or 128 bytes in most modern processors. In my machine, it’s 64 bytes, so I’ve added 7 more dummy long
values right after the tid
declaration.
Usually, those threadLocal*
variables will be updated in the same thread. So we’re better off just isolating the tid
:
Reader threads won’t encounter a cache miss and the writer won’t need to flush it store buffer right away, as those local variables aren’t volatile.
Contended Annotation
The jdk.internal.vm.annotation.Contended
annotation (sun.misc.Contended
if you’re on Java 8) is a hint for the JVM to isolate the annotated fields to avoid false sharing. So, now the following should make more sense:
With the help of the ContendedPaddingWidth
tuning flag, we can control the padding width.
Conclusion
Before wrapping up, the threadLocalRandomSecondarySeed
is a seed used internally by the likes of ForkJoinPool
or ConcurrentSkipListMap
. Also, the threadLocalRandomProbe
represents whether the current thread has initialized its seed or not.
In this article, we explored different tricks to optimize an RNG to be a high throughput and low latency one. Tricks like more efficient object allocation, more efficient memory access, removing unnecessary indirection, and having mechanical sympathy.
As usual, the sample codes are available at Github.