Java 7 introduced the ThreadLocalRandom to improve the random number generation throughput in highly contended environments.

The rationale behind ThreadLocalRandom is simple: instead of sharing one global Random instance, each thread maintains its own version of Random. This, in turn, reduces the contention and hence the better throughput.

Since this is such a simple idea, we should be able to roll our sleeves up and implement something like ThreadLocalRandom with a comparable performance, right?

Let’s see.

## First Attempt

In our first attempt, we’re going to use something as simple as ThreadLocal<Random>:

In this benchmark, we’re comparing the Random, our own ThreadLocal<Random> and the builtin ThreadLocalRandom:

The ThreadLocalRandom generated around 1 billion random numbers per second, dwarfing the ThreadLocal<Random> and plain Random approaches by a whopping margin.

## The Linear Congruential Method

By far the most popular random number generator in use today is known as Linear Congruential Pseudorandom Number Generator, introduced by D. H. Lehmer in 1949.

This algorithm starts with an initial seed value, . Then generates each random variable from the previous one as following:

In the real world, it’s more practical to use just one variable to implement this. So, we should update the seed value every time we generate a number. Quite interestingly, the next(int) method in the Random class is responsible for this:

Since multiple threads can potentially update the seed value concurrently, we need some sort of synchronization to coordinate the concurrent access. Here Java is using a Lock Free approach with the help of atomics.

Basically, each thread tries to change the seed value to a new one atomically via compareAndSet. If a thread fails to do so, it will retry the same process until it could successfully commit the update.

When the contention is high, the number of CAS failures increases. That’s the main reason behind the poor performance of Random in concurrent environments.

## No More CAS

In implementations based on ThreadLocal, the seed values are confined to each thread. Therefore, we don’t need atomics or any other form of synchronization for that matter as there is no shared mutable state:

If we run the same benchmark again:

MyThreadLocalRandom has almost twice the throughput of our simple ThreadLocal<Random>.

The compareAndSet provides atomicity and memory ordering guarantees that we simply don’t need in a thread confined context. Since those guarantees are costly and unnecessary, removing them increases the throughput substantially.

However, we’re still behind the builtin ThreadLocalRandom!

## Removing Indirection

As it turns out, each thread doesn’t need a distinct and complete copy of Random for itself. It just needs the latest seed value, that’s it.

As of Java 8, these values have been added to the Thread class itself:

Instead of using something like MyThreadLocalRandom, each thread instance maintains its current seed value inside the threadLocalRandomSeed variable. As a result, the ThreadLocalRandom class becomes a singleton:

Every time we call the ThreadLocalRandom.current(), it lazily initializes the threadLocalRandomSeed and then returns that singelton:

### Hashtable Lookup

With MyThreadLocalRandom, every call to the current() factory method, translates to a hash value computation for the ThreadLocal instance and a lookup in the underlying hashtable.

On the contrary, with this new Java 8+ approach, all we have to do is to read the threadLocalRandomSeed value directly and update it afterward.

## Efficient Memory Access

In order to update the seed value, java.util.concurrent.ThreadLocalRandom needs to change the threadLocalRandomSeed state in the java.lang.Thread class. If we make the state public, then everybody can potentially update the threadLocalRandomSeed, which is not that good.

We can use reflection to update the non-public state, but just because we can, does not mean that we should!

As it turns out, the ThreadLocalRandom uses the Unsafe.putLong native method to update the threadLocalRandomSeed state efficiently:

Basically the putLong method writes the r value to some memory address relative to the current thread. The memory offset already calculated by calling another native method, i.e. Unsafe.objectFieldOffset.

As opposed to reflection, all these methods have native implementations and are very efficient.

## False Sharing

CPU caches are working in terms of Cache Lines. That is, the cache line is the unit of transfer between CPU caches and the main memory.

Basically, processors tend to cache a few other values along with the requested one. This spatial locality optimization usually improves both throughput and latency of memory access.

However, when two or more threads are competing for the same cache line, multithreading may have a counterproductive effect.

To better understand this, let’s suppose that the following variables are residing in the same cache line:

A few threads are using the tid or thread id for some unknown purpose. Now, if we update the, say, threadLocalRandomSeed in our thread to generate a random number, nothing bad should happen, right? It does not sound that much of a deal as some threads are reading the tid and another one writes to a whole another memory location.

Despite what we might think, since all those values are in the same cache line, the reading threads will encounter a cache miss. And the writer needs to flush its store buffer. This phenomenon, known as false sharing, incurs a performance hit to our multithreaded application.

To avoid the false sharing issue we can add some padding around the contended values. This way each of those highly contended values will reside on their own cache line:

Cache line size is usually 64 or 128 bytes in most modern processors. In my machine, it’s 64 bytes, so I’ve added 7 more dummy long values right after the tid declaration.

Usually, those threadLocal* variables will be updated in the same thread. So we’re better off just isolating the tid:

Reader threads won’t encounter a cache miss and the writer won’t need to flush it store buffer right away, as those local variables aren’t volatile.

### Contended Annotation

The jdk.internal.vm.annotation.Contended annotation (sun.misc.Contended if you’re on Java 8) is a hint for the JVM to isolate the annotated fields to avoid false sharing. So, now the following should make more sense:

With the help of the ContendedPaddingWidth tuning flag, we can control the padding width.

## Conclusion

Before wrapping up, the threadLocalRandomSecondarySeed is a seed used internally by the likes of ForkJoinPool or ConcurrentSkipListMap. Also, the threadLocalRandomProbe represents whether the current thread has initialized its seed or not.

In this article, we explored different tricks to optimize an RNG to be a high throughput and low latency one. Tricks like more efficient object allocation, more efficient memory access, removing unnecessary indirection, and having mechanical sympathy.

As usual, the sample codes are available at Github.