15 Apr 2010

Rethinking Multi-Threaded Design Principles, Part 2

Rethinking Multi-Threaded Design Principles, Part 2: "

Harnessing the processing power of next generation multi-core processors

Rethinking Multi-Threaded Design Principles, Part 2

Wed, 2010-04-14

Harnessing the processing power of next generation multi-core processors


In Part 1 of this series, I introduced some basic principles for working in a multi-threaded

application. Kevin Farnham, in his blog, pointed out that our next challenge

would be to design an application that fully harnesses the processing speed of

next generation multi-core processors. That challenge has made me rethink some of the various means by which that goal can be achieved.

Laws governing a processor's speed and performance:

In terms of concurrency, when multiple threads run simultaneously, only one thread

gets executed at a time. But to achieve parallelism, the processor should support

multiple threads that can be executed at any given point in time.

Flynn's taxonomy

classifies processing platforms based on Instruction Set and Data stream, as follows:

Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple

Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD). As modern

processors fall under either SIMD or MIMD, it makes it possible for us to write a data

or task centric application that could run in parallel.

Since historically processor clock speed was following Moore's law, there was no real need for most programs to execute multiple instructions at a time. Applications simply ran faster because new processors were faster. However, today processor speeds are no longer increasing rapidly.

To increase overall processing power, chip manufacturers began developing processors that provide instruction level parallelism: the processors evolved to have multiple processing units in them. While some processor architectures include a logical processing unit within the same space to allow Simultaneous Multi-Threading, others go a step further by implementing two or more execution units (Cores) in a processor.

To see how much performance improvement the parallelism contributes, we can
easily derive an equation:

Speed Enhancement = (Time needed for Sequential execution) / (Time needed for Parallel execution)

Considering number of processor cores, the equation has been further simplified by Amdahl's Law as:

Speed Enhancement = 1 / (Fs + ((1 - Fs) / N)),

where Fs is the fraction of entire program spent running sequentially and N being number of cores in the processors. Here, if N is 1, we get no speedup, but as N increases (tends to infinity, (1 - Fs)/N becomes 0), the speed is totally dependent on the 1/Fs part - which leads to the conclusion that, to improve execution speed of an application, we need to figure out ways or patterns to make more code execute in parallel.

Finding different ways to lock

As we go on finding ways to refactor the code into parallelized parts, it

becomes quite clear that some way or other, different parts of the code must contend for some resource that requires a lock. Whenever any thread tries to access a lock that's already being held by

someone, it gets suspended first and awakened later when lock is available, thus

spending some time doing nothing. In this case the JVM manages everything for you, but

after holding the lock, if the thread is blocked forever (Database connection down

or some I/O operation), it would suspend all waiting threads, causing resource

starvation. To ameliorate the situation, we can implement our own lock and let

client code handle acquiring or releasing the lock by itself. Though this sounds

appealing, the task to implement that would undoubtedly be daunting, as you have to

manage the lock by yourself. A simple mistake can make you fail miserably.

Lets revisit our previous CarBookings example:

class CarBookings{

final ReentrantLock lock = new ReentrantLock();
private final Set bookings = new HashSet();

public boolean putCar(CarBooking booking){
boolean status = false;
if(!lock.isLocked() ){
try {
} finally {
status = true;
return status;

So, when BookingRequester calls the putCar method, if it's already being used (locked) by

someone else it returns false, providing better control to the requester on whether to

retry or abort, without going through a Run - Suspend - Run phase. Though it looks

simple, it gets more complicated with multiple locks; it becomes very easy

for someone to forget to put in the unlock code, resulting in all threads waiting forever. So, make sure you keep track of all locks, and make sure they will be unlocked properly.

Non-Blocking operation

In addition to strictly following the rules of using explicit locks, another way to achieve concurrency is to perform non-blocking operation on the contention points. The Compare-And-Swap (CAS) is one of such operation, where a field can be updated with new value only when certain precondition is met. So, when multiple threads try to make a non-blocking call, only one succeeds at a time, but others are given a choice either to continue re-trying or abort the operation without falling prey to being suspended and awakened later on. This allows them to be in a Running state even if the lock is not acquired, unlike the case of a synchronized lock where they have to be in a Suspended state for some time and come back to Running state when the lock is available. Clearly, as the Non-blocking-Lock-acquiring process reduces the extra transition (Run-Suspend-Run) time, it becomes the better choice for developing highly scalable application.

Let's implement the queue (BookingQ or AdviceQ) from the previous article.

public class BookingQ{

private static class Node {
CarBooking booking;
Node nextBooking;
Node(CarBooking cb, Node next) {
this.nextBooking = next;
booking = cb;
boolean compareAndSetNext(Node en, Node sn){
this.nextBooking = sn;
return true;
return false;

boolean compareAndSetNode(Node node, Node en, Node sn){
node = sn;
return true;
return false;

private volatile Node head;
private volatile Node tail;

public BookingQ() {
tail = head = new Node(null,null);

public void putBooking(CarBooking booking){
Node newBooking = new Node(booking,null);
Node currBooking = tail;
Node nextTail = tail.nextBooking;
if(currBooking == tail){
if(nextTail != null){ //Line 1
compareAndSetNode(this.tail, currBooking, nextTail);
}else{ //Line 2
if(currBooking.compareAndSetNext(null, newBooking)){ // Line3
compareAndSetNode(this.tail, currBooking, newBooking); // Line4

public CarBooking getBooking(){
Node currHead = head;
Node currTail = tail; //Line 5a
Node headNext = currHead.nextBooking; //Line 5b
if(currHead == head){
if(currHead!=currTail){ //Line6, Some bookings have been made
currHead, headNext)){ //Line7, Advancing head
CarBooking firstBooking = headNext.booking;
headNext.booking = null; //Line8
return firstBooking; //Line9
}else{ //No booking yet
if(headNext != null){ //Line10
compareAndSetNode(this.tail, currTail, headNext);
return null;

Here, the unit element of the queue is a Node that has a booking property holding the CarBooking and nextBooking to point to the next Node in queue. The BookingQ starts with
its head and tail pointer set to a Node with NULL booking and NULL nextBooking. The process of putting any node into the queue requires updating two pointers:

  • Step 1: set the next pointer of current node to the new node

  • Step 2: advance the tail pointer to the new node also

After successfully adding a node to the queue, the head's nextBooking would point to new not-NULL node, but the booking element still remains NULL.

The CAS is applied wherever there is a contention point, like updating a node in
compareAndSetNext or compareAndSetNode.

Now consider a situation when a new booking is being added in putBooking by a thread (T1). It checks to find tail of the queue in Line 2 and does compareAndSetNext. But between Step 1 and 2 if another thread (T2) tries to put another booking and finds in Line 1 that another thread's work is underway (meaning that Line 3 is done, but 4 is not, so the tail's nextBooking no more NULL), it then completes T1's unfinished job (Step 2) of updating the tail pointer to the new node. After that, T2 goes to the next iteration to put a new booking.

In case of getBooking, when the tail and head are not same (Line 6), it advances the head pointer to the next node of first booking node (Line 7) and returns its booking element (Line 9). Setting headNext.booking to NULL (Line 8) resets the head node back to a starting point. As head no longer references the previous node, the Garbage Collector would collect it eventually, making the queue length shorter. In line 10, a similar situation to putBooking may arise when a thread tries to read off of the empty queue and between Line 5a and 5b another thread puts a booking (Line 3), and in that case it also advances the tail pointer to the new booking.

Note, we could do the same CAS using the Java AtomicReference class, but for clarity of understanding we've just elaborated it a bit.


Refering back to important laws, it becomes clear that though modern processors provide better means to improve application performance, the program is where the work of implementing parallelism must occur. The more the program can be decomposed into parallelized parts, the better it performs on multicore computers. In addition, it becomes clear that while working with explicit locks gives you better control than the JVM intrinsic lock, non-blocking operations also enable you achieve higher performance than any synchronized operation. In any case, parallelized programs must be designed carefully and tested properly. Even small oversights may result in undesirable consequences in multithreaded applications.

Dibyendu Roy has more than ten years of design and development experience in various domains including Banking and Financial Systems, Business Intelligence tools and ERP products.


Post a Comment