Monday, April 11, 2011

Mailbox Part 1

Actors communicate (only) via message passing. This makes the mailbox implementation critical for the overall system performance.

The mailbox is a FIFO queue, owned by an actor. Any number of other actors (writers) enqueue new messages in parallel but only the owning actor (reader) is allowed to dequeue a message. Thus, a mailbox is a Single-Reader-Many-Writer queue.

The first (and most important) thing we need from a mailbox is speed / scalability. Today I'll represent the three most promising algorithms I've testet so far. There is a large number of concurrent linked list and queue algorithms out there, but a lot of them require exotic atomic operations not available on mainstream hardware (e.g. double-width compare-and-swap [DCAS] or load-link/store-conditional).

  1. Spinlock Queue

    This implementation is based on an article of Herb Sutter. He adapted the algorithm from a paper of M. Michael and M. Scott (second algorithm in the PDF). The original implementation uses two spinlocks (one guards the head to synchronize readers and the other one guards the tail to synchronize writers). Because we don't need to synchronize readers, the testet version uses only one spinlock.

  2. Lock-Free Queue

    This is a variation of the first algorithm. But instead of locking tail, we just move the tail atomically and set the next pointer of the predecessor afterwards on success.

    Setting tail first leads to the situation, that the reader might see an empty queue although there are new elements in the list. If the writer gets suspended right after the CAS operation, the reader also has to wait for the writer to continue (and further enqueue operations are hidden too).

    Again, there are smarter algorithms in theory. But none of them are available on today's hardware. That's mainly because you can't solve the ABA Problem without some kind of DCAS operation. And you have to solve the ABA Problem if you want to do something "clever" with A. This variation doesn't need to solve the ABA Problem, because we're only replacing A (the tail). And it doesn't bother us if we're really replacing the A that we saw or a 'new' A.

  3. Cached Stack

    A queue always needs at least two (critical) writes to enqueue a new element:
    void queue::push(T* new_element) {
      atomic { // the holy grail of concurrency
        tail->next = new_element;
        tail = new_element;
      }
    }
    A stack only needs one compare-and-swap operation for push. That's the motivation to the third algorithm. The concurrent part of the queue is a stack. That makes the push operation easy, safe and fast (ST = Stack Tail):



    The tricky part is the pop operation, because the stack is LIFO ordered. An O(N) pop (traverse always to the last stack element) is unacceptable. The solution is to introduce a singly linked list as a cache (CH = Cache Head):



    The average case of pop is now O(1) (as long as there are still elements cached), but the worst case is O(N). That makes the runtime of pop somewhat unpredictable, but push is extremely fast.


Performance

The algorithms were testet on a 2.66 GHz Intel i7 (DualCore). The x-axis shows the total number of messages in million (each producer is sending $TOTAL / $NUMBER_PRODUCERS messages).



The next benchmark will run on a real multicore machine (>=12 cores) to test the implementations with different levels of (hardware) concurrency.

I'm still looking for new (on intel machines available) algorithms, although cached stack looks pretty good in this first benchmark.

No comments:

Post a Comment