On Generating Identity Hash Codes
This simple piece of code probably prints one of the most misunderstood outputs in Java:
And that output would be:
What does that part after @
represent? Is this the memory address of the object? Is it the hashcode? How about both? or even none?
Over the course of this article, we’re going to see how the HotSpot JVM generates this value and what it represents. so, Let’s get started then!
Generation Strategies
As of this writing, HotSpot JVM has a few different strategies to generate hashcodes:
As shown above, the hashcode generation strategy is determined by a mysterious and yet well-named hashCode
variable. Let’s see where does this variable come from:
So this hashCode
variable is actually an experimental tuning flag with a default value of 5
:
Put simply, we can use the -XX:+UnlockExperimentalVMOptions -XX:hashCode=<i>
combination of tunables to change the strategy.
Now let’s see how each strategy works.
Park–Miller/Lehmer RNG
The first hashcode generation approach uses one of the most common random number generation strategies: a class of linear congruential generator (LCG) algorithms known as Lehmer RNG or even Park–Miller RNG:
This algorithm starts with an initial seed value, . Then generates each random variable from the previous one as following:
Let’s take a look at the os::random()
defintion:
The os::random()
tries to generate a random number from the global _rand_seed
variable and then update that seed atomically.
When multiple threads try to change the seed, only one of them can successfully change the seed and the cmpxchg
will fail for others. Losing threads will retry the same operation until they succeed.
Therefore, in the presence of high contention, the rate of CAS failures will increase, hence this comment:
Quite interestingly, this high contention happens when a lot of objects are trying to generate their hashcode for the first time. This doesn’t seem to be a contention point from the Java perspective. However, with the -XX:hashCode=0
this contention exists under the hood.
OOPs Based
The second and fifth strategies are using a function of memory address as the hashcode:
So if we use either of -XX:hashCode=1
or -XX:hashCode=4
, the hashcode will depend on the memory address.
Static Number
Probably the coolest, least useful, and most efficient strategy is the third one:
Which generates 1, all the time!
If we run the same snippet with -XX:+UnlockExperimentalVMOptions -XX:hashCode=2
:
Then it will print:
So the part after @
is always hashcode, at least! I’m guessing they’re using this strategy as a benchmark baseline. It’s just a guess, though. If that’s true, then this strategy ain’t that useless after all.
Sequential Numbers
The fourth strategy is basically an auto-increment
for hashcode generation:
If we run the following code with -XX:+UnlockExperimentalVMOptions -XX:hashCode=3
, we will probably see some consequetive hashcodes:
Marsaglia’s Xor-Shift
As of this writing, if we pass anything more than 4 as the value of -XX:hashCode
, this random number generator will be used:
The implementation seems a bit complicated. However, the idea is simple. Instead of using some global shared mutable state as the seed, this is using a thread-specific state to generate the random number. Therefore, it will outperform the os::random()
and the sequential approach, as there is no need for thread synchronization.
Currently, this is the default hashcode generation strategy.
Good Hashcodes
A hashcode implementation is a good one if it exhibits both uniform distribution and good performance. Let’s evaluate each strategy with respect to these parameters:
- The
os::random()
approach has good uniformity and randomness. However, it won’t perform that well in highly contended environments - The memory address based approach usually won’t exhibit uniform distribution, which is very critical for hashcodes
- The one that always returns 1 is fun!
- The Marsaglia’s Xor-Shift generates random numbers with good distribution and also, good performance
Here’s a benchmark result from Aleksey Shipilëv:
Conclusion
Just to recap, the part after @
is definitely the identity hashcode.
The hashcode itself is usually a random number but can also be a function of the memory address. The identity hashcode, in the HotSpot JVM, consumes at most 31 bits of the object header, while the memory address may be up to 64 bits (without compressed references). Therefore, the hashcode may not be equal to the memory address, even though it can be a function of it!
Before wrapping up, it’s worth taking a look at this mailing list on the same topic.