Java 7 introduced the
ThreadLocalRandom to improve the random number generation throughput in highly contended environments.
The rationale behind
ThreadLocalRandom is simple: instead of sharing one global
Random instance, each thread maintains its own version of
Random. This, in turn, reduces the contention and hence the better throughput.
Since this is such a simple idea, we should be able to roll our sleeves up and implement something like
ThreadLocalRandom with a comparable performance, right?
In our first attempt, we’re going to use something as simple as
In this benchmark, we’re comparing the
Random, our own
ThreadLocal<Random> and the builtin
ThreadLocalRandom generated around 1 billion random numbers per second, dwarfing the
ThreadLocal<Random> and plain
Random approaches by a whopping margin.
The Linear Congruential Method
By far the most popular random number generator in use today is known as Linear Congruential Pseudorandom Number Generator, introduced by D. H. Lehmer in 1949.
This algorithm starts with an initial seed value, . Then generates each random variable from the previous one as following:
In the real world, it’s more practical to use just one variable to implement this. So, we should update the
seed value every time we generate a number.
Quite interestingly, the
next(int) method in the
Random class is responsible for this:
Since multiple threads can potentially update the
seed value concurrently, we need some sort of synchronization to coordinate the concurrent access. Here Java is using a Lock Free approach with the help of atomics.
Basically, each thread tries to change the seed value to a new one atomically via
compareAndSet. If a thread fails to do so, it will retry the same process until it could successfully commit the update.
When the contention is high, the number of CAS failures increases. That’s the main reason behind the poor performance of
Random in concurrent environments.
No More CAS
In implementations based on
ThreadLocal, the seed values are confined to each thread. Therefore, we don’t need atomics or any other form of synchronization for that matter as there is no shared mutable state:
If we run the same benchmark again:
MyThreadLocalRandom has almost twice the throughput of our simple
compareAndSet provides atomicity and memory ordering guarantees that we simply don’t need in a thread confined context. Since those guarantees are costly and unnecessary, removing them increases the throughput substantially.
However, we’re still behind the builtin
As it turns out, each thread doesn’t need a distinct and complete copy of
Random for itself. It just needs the latest
seed value, that’s it.
As of Java 8, these values have been added to the
Thread class itself:
Instead of using something like
MyThreadLocalRandom, each thread instance maintains its current
seed value inside the
As a result, the
ThreadLocalRandom class becomes a singleton:
Every time we call the
ThreadLocalRandom.current(), it lazily initializes the
threadLocalRandomSeed and then returns that singelton:
MyThreadLocalRandom, every call to the
current() factory method, translates to a hash value computation for the
ThreadLocal instance and a lookup in the underlying hashtable.
On the contrary, with this new Java 8+ approach, all we have to do is to read the
threadLocalRandomSeed value directly and update it afterward.
Efficient Memory Access
In order to update the seed value,
java.util.concurrent.ThreadLocalRandom needs to change the
threadLocalRandomSeed state in the
java.lang.Thread class. If we make the state
public, then everybody can potentially update the
threadLocalRandomSeed, which is not that good.
We can use reflection to update the non-public state, but just because we can, does not mean that we should!
As it turns out, the
ThreadLocalRandom uses the
Unsafe.putLong native method to update the
threadLocalRandomSeed state efficiently:
putLong method writes the
r value to some memory address relative to the current thread. The memory offset already calculated by calling another native method, i.e.
As opposed to reflection, all these methods have native implementations and are very efficient.
CPU caches are working in terms of Cache Lines. That is, the cache line is the unit of transfer between CPU caches and the main memory.
Basically, processors tend to cache a few other values along with the requested one. This spatial locality optimization usually improves both throughput and latency of memory access.
However, when two or more threads are competing for the same cache line, multithreading may have a counterproductive effect.
To better understand this, let’s suppose that the following variables are residing in the same cache line:
A few threads are using the
tid or thread id for some unknown purpose. Now, if we update the, say,
threadLocalRandomSeed in our thread to generate a random number, nothing bad should happen, right? It does not sound that much of a deal as some threads are reading the
tid and another one writes to a whole another memory location.
Despite what we might think, since all those values are in the same cache line, the reading threads will encounter a cache miss. And the writer needs to flush its store buffer. This phenomenon, known as false sharing, incurs a performance hit to our multithreaded application.
To avoid the false sharing issue we can add some padding around the contended values. This way each of those highly contended values will reside on their own cache line:
Cache line size is usually 64 or 128 bytes in most modern processors. In my machine, it’s 64 bytes, so I’ve added 7 more dummy
long values right after the
threadLocal* variables will be updated in the same thread. So we’re better off just isolating the
Reader threads won’t encounter a cache miss and the writer won’t need to flush it store buffer right away, as those local variables aren’t volatile.
jdk.internal.vm.annotation.Contended annotation (
sun.misc.Contended if you’re on Java 8) is a hint for the JVM to isolate the annotated fields to avoid false sharing. So, now the following should make more sense:
With the help of the
ContendedPaddingWidth tuning flag, we can control the padding width.
Before wrapping up, the
threadLocalRandomSecondarySeed is a seed used internally by the likes of
ConcurrentSkipListMap. Also, the
threadLocalRandomProbe represents whether the current thread has initialized its seed or not.
In this article, we explored different tricks to optimize an RNG to be a high throughput and low latency one. Tricks like more efficient object allocation, more efficient memory access, removing unnecessary indirection, and having mechanical sympathy.
As usual, the sample codes are available at Github.