Monday, May 03, 2010

The Implementation of Volatile in the JVM

Scanning through the code between the JVM bytecode interpreter and the JIT compiler, I was perplexed to why an interpreter does not care to differentiate between a volatile memory field access to a non-volatile one, while a JIT actually emits special code to deal with it.

But after thinking though a while, I realise why this actually makes sense. Before we find out why an interpreter does not need to care about volatile semantics, we need to first understand the idea of causality, and then understand what the volatile keyword is for.

Lets start off with a very simple snippet of code:


a = 1
v1 = a


We expect v1 to have 1 because we put 1 into a, and then a into v1. The assignment of a = 1 necessarily orders before the operation of v1 = a. This necessity is termed as 'causal order'. Subsequently we use the notation a = 1 \rightarrow v1 = a to imply this causal ordering.

For illustration, we can show that not all program executions need to follow causal order:


a = 1
b = 2
v1 = a
v2 = b


Assuming that v1 and v2 needs to be assigned the value of a and b eventually, then there is a causal order such that the execution of the statement a = 1 \rightarrow v1 = a, and a causal order of b = 2 \rightarrow v2 = b. However there is no such requirement needed between the statement of a = 1 and b = 2. In fact, if we swapped the order around, the program will continue to execute fine:


b = 2
a = 1
v1 = a
v2 = b


As long as we don't break causal ordering, executions that has no causal order requirements have an interesting property of parallelism. We are able to break the such programs into separate parts and execute them in parallel and still be able to satisfy program requisites:

CPU1
CPU2
a = 1
b = 2
v1 = a
v2 = b

But with multi-processors, we can violate program determinism quite easily if we are not careful. Consider the following example:

CPU1
CPU2
a = 1
a = 2
v1 = a


In the absence of any ordering restrictions, we term the execution as concurrent; it is legal for the value of v1 to be either 1 or 2. In practice this value most likely be 1 because of cache locality, but this is not a guarantee. Even though executions between CPU1 and CPU2 can happen at exactly the same quantum, we treat such concurrent executions as if they are a special case of interleaving; either CPU1's executions always happen slightly before CPU2's executions, or CPU2's executions always happen slightly before CPU1's executions. Corollary: CPU1's and CPU2's executions never happen simultaneously, hence can be considered to be functionally equivalent to a multi-threaded application. This resulting execution is known to have a partial-ordering, ie, the execution between CPU1 and CPU2 can be interleaved in any possible permutations.

For compiled languages like C/C++, there is no implicit notion of concurrency; this is also true for a just-in-time compiler within Java. No notion of concurrency is fine and dandy on a 1-CPU machine architecture, but not so on a multi-processor system or in a language that has a notion of threads. Let's understand why by first examining a simple single-threaded code:


a = 1
a = 2
v1 = a
v2 = a
if ( v1 == v2 ) {
v3 = true
}
return v3


If causal order is upheld, then naturally a = 1 \rightarrow a = 2 \Rightarrow v1 == v2 == a == 2. On a single-threaded execution that is causally ordered, the resulting execution is also totally ordered. Hence, an optimizing compiler may say, "hey, why don't I skip all the steps and assign v3 = true directly, since v1 == v2 always holds?". In the end, the compiler generates the code that looks like this:


v3 = true;
return v3; // or simply "return true;"


Now consider a similar, but multi-threaded code:

T1
T2
a = 1
a = 0
v1 = a
a = 2
v2 = a

if ( v1 == v2 )

{

v3 = true

}

return v3


Notice that both threads T1 and T2 still follows causal order. For T1, a = 1 \rightarrow v1 = a \rightarrow v2 = a, hence implies that v1 == v2 == 1. On T2, a = 0 \rightarrow a = 2 is shown as a trivial case of causal order. However, there is no total ordering enforced among the execution between the two threads.

For a given compiler, it is very difficult to be able to reason the ordering of execution across a number of threads, given each thread's execution sequence is normally examined in isolation. The reason for this limitation is simple: in order for a compiler to be able to optimise for causality safety, it needs to examine all executions across all threads; however it is theoretically possible for the number of threads to be (1) dynamically generated; (2) the number of threads generated to be unbounded and therefore infeasible to make such an analysis.

Usually the compiler will optimise on the assumption that causality safety is assumed, where the onus is on the programmer to indicate otherwise through the volatile keyword.

In the absence of the volatile keyword, the compiler is free to assume the following optimisation:

T1
T2
v3 = true
// does nothing, since a is unused
return v3; // or simply "return true";


Therefore, the volatile keyword means for all accesses of a, always emit the code for the accesses. By doing so, it retains the code's original execution and hence execution coherency.

This is also why the volatile semantics within an interpreter is always upheld.

As an interpreter always execute instructions line-by-line, therefore no inferences and optimisations actually occurs compared to compiled code. Thus the interpreter needs not enforce the volatile mechanic, so as long as the underlying machine architecture adheres to program causality.

And this behaviour should be safe on machine architectures like the x86, but on esoteric machine architectures that performs instruction reordering, eg. Alpha and Itanium, the interpreter may require some additional memory fencing instructions to ensure that volatile semantics are upheld.

0 comments:

Post a Comment