[Sedna-discussion] Enhancement proposal #1 (VMM improvements).

Nick Zavaritsky Wed, 24 Aug 2011 06:08:20 -0700

Hi everyone!

I haven't grok Sedna source code for years.
When I finally fetched the sources from git this week (to check it on OS X 
Lion) I was struct with a desire to contribute to Sedna :)
So here is my enhancement proposal #1.





=== ABSTRACT ===

The world of computer hardware has drastically changed. Multicore CPU's is our 
reality.
Many software designs require tweaks or even a major overhaul to take the full 
advantage of multicores.
One particular design needing attention badly is the framework to reclaim 
buffer memory.
I don't pretend having a slightest idea how to pick a buffer to reclaim in a 
smarter way than it is currently done.
However the machinery to notify transactions of the buffer gone and to protect 
the buffer a transaction currently
works with from being reclaimed can be improved.

The problem is that the current design doesn't scale well with the increase in 
the number of active transactions.
The problem is actually threefold. 

First the core algorithm processes transactions sequentially, each transaction 
requiring
multiple IPC calls to carry out processing. It would be better if we could 
process transactions in parallel.
Or we could make the processing much more faster so that it would be ok to stay 
sequential.

Second the algorithm may restart from the very beginning if it discovers that 
the buffer it attempts 
to reclaim is currently in use by a transaction. I believe this being a minor 
problem but nevertheless worth mentioning.

Finally the algoritm to process a transaction is synchronous: basically we send 
a message to a transaction process
and wait for reply. It would be better if we didn't have to wait for reply at 
all: i.e. go asynchronous.

THE SOLUTION: move more state to the shared memory and use lock-free approach.

THE COST: one InterlockedCompareExchange or similar intrinsic in CHECKP, one 
conditional in CHECKP, slightly more
complex code due to indirection required to access the state in shared memory 
in CHECKP 

It is expected that the implyed overhead is low: Andrew Fomitchev got somewhat 
near 7% slowdown when evaluating
a different design with a spinlock in CHECKP which is very similar to the 
proposed design both in the terms of code size
increase and the use of interlocked intrinsic. Or does my memory serve me bad 
and 7% being just the gay/total population
ratio?




=== CURRENT DESIGN OVERVIEW ===

If a buffer manager (BM) runs short of  the buffer memory it has to reclaim a 
buffer.
Buffer eviction policy was FIFO a few years ago, however it doesn't matter at 
all in the scope of the current discussion.
So BM picks a buffer somehow and attempts to reclaim it. Unmap_block_in_trs() 
does the job.
If it failed to reclaim the buffer because the buffer was in use when the call 
was made BM picks
another buffer and retries to reclaim. These steps repeat untill 
unmap_block_in_trs() finally succeeds.

Unmap_block_in_trs() basically calls unmap_block_in_tr() for each transaction 
registered.

Unmap_block_in_tr() wakes the VMM thread in the corresponding transaction's 
process. There is
a semaphore for this purpose. Then unmap_block_in_tr() waits for reply (another 
semaphore). VMM thread suspends the
main transaction thread and checks the current XPTR to decide whether it is ok 
to unmap the requested
block. If the decision was positive VMM thread unmaps the block. Finally it 
resumes the main thread,
passes the decision result back to unmap_block_in_tr() through shared memory 
and wakes
unmap_block_in_tr(). 

The design is slightly different on UNIX - instead of 
SuspendThread()/ResumeThread() VMM thread
sends a signal to the main transaction thread and most code execute in the 
signal handler.




=== NEW DESIGN PROPOSAL ===

So what if the buffer manager could check the current XPTR directly and send 
unmap requests without
the need to wait for completion?

The following discussion omits some technical detail inherent to distinct 
address spaces in TR and SM.
For simplicity we describe our design as if everything happened in one address 
space. The real implementation
will be somewhat more complicated.

So we need to move the current XPTR to shared memory, huh?

struct VMMState
{
        Xptr currentXPTR;
};

VMMState *pVMMState; /* allocated in the shared memory */


Wait!  We won't be able to set pVMMState->currentXPTR atomically.

struct VMMState
{
        struct VMMStateData
        {
                Xptr currentXPTR;
        };
        
        VMMStateData *currentState;
        VMMStateData  statesPool [2];
};


Lets apply the common aproach: if in need to modify a complex structure 
atomically, allocate and initialize a new one
and then swap pointers atomically. Since there is a single thread modifying the 
currentXPTR two instances of VMMStateData
would be enough so we use statesPool instead of malloc().

So far so good. But there is currently no way for the unmap request to step in. 

struct VMMState
{
        struct VMMStateData
        {
                Xptr currentXPTR;
        };
        
        VMMStateData *currentStateOrSpecial;
        VMMStateData *currentState;
        VMMStateData  statesPool [2];

        Mutex                  mutex;
        VMMUnmapCmd
                                     queuedUnmapCmds [42];
        bool                     didQueueOverflow; 
};


CurrentStateOrSpecial stores a pointer to the current state much like 
currentState.
However the buffer manager can swap the current state pointer in 
CurrentStateOrSpecial 
with a special value to indicate that VMMState has been externally modified. 
CHECKP
will update both currentState (regular assignment) and currentStateOrSpecial 
(InterlockedCompareExchange).
Most of the time InterlockedCompareExchange will succeed since the buffer 
manager
didn't step in and currentStateOrSpecial == currentState.

When the buffer manager processes VMMState it locks the mutex first and then 
swaps currentStateOrSpecial
with a 'special' value. The operation yields the former currentStateOrSpecial 
value which we use to check
whether the buffer we intend to unmap is currently in use. We then update 
queuedUnmapCmds/ didQueueOverflow fields and unlock the VMMState.

When InterlockedCompareExchange in CHECKP fails it means that the VMMState was 
modified by the buffer
manager. In the later case the "slow path" is executed. We lock the mutex and 
execute whatever queued umap cmds
or reset VMM completely if the queue has overflowed. We then store a pointer to 
the current state in currentStateOrSpecial
and unlock the mutex.

"Slow path" is obviously going to be a separate function (CHECKP code is 
inlined).




=== Q&A ===

Q1: Is it safe for the buffer manager to proceed without waiting for unmap 
command to complete?

A1: There must be a CHECKP before the unmapped memory is going to be accessed.
CHECKP performs all queued commands before processing. Don't write buggy code.


Q2: Is it safe for the buffer manager to access currentState data which is 
being concurrently modified?

A2: Sure. There are two instances of the state. The pattern is as follows: 
modify B, replace pointer to A with pointer to B;
modify A, replace pointer to B with pointer to A. The state instance is never 
modified when accessible via
currentState pointer.

Assume we have just swapped currentStateOrSpecial with 'special' and find out 
that the current state is A. 

A <- (currentStateOrSpecial <- SPECIAL)

A is not going to be modified since it is stored in currentState. CurrentState 
must point to B before A could be
modified. But replacing A with B in currentState requires currentStateOrSpecial 
update as well which
will fail since we've stored a special value there. The failure will trigger 
the "slow path" execution and it
locks pVMMState->mutex. But the buffer manager locks VMMState mutex before 
touching currentStateOrSpecial,
so the slow path will block until we have done with the currentState data and 
have released the mutex. 




=== CONCLUSION ===

The presented design enbles to reclaim a buffer efficiently and it scales well 
with the number of active transactions.
Transactions are still processed sequentially. This is a non-issue however 
since much lighter synchronization primitives
are used and unmap requests are asynchronous. The expected performance 
degradation in single user
scenarios due to CHECKP complications is negligible.




=== CONCERNING NON CHECKP-INTENSIVE CODE ===

The presented design assumes that CHECKP is executed quite often to process 
queued unmap commands in the
timely fashion. Though queue overflow is non fatal ("only" need to reset the 
entire VMM), it implies non-trivial
overhead.

We obviously need CHECKP-less mode to activate when (ex.) waiting for the lock 
on XML document to be granted
or talking with a client through a socket.

The interesting property of CHECKP-less mode is that there are no CHECKPs being 
executed, hence VMM is not being accessed,
hence there is no need to check the currentXptr. There is still the need to 
send asynchronous unmap commands.

Lets introduce a secondary thread to process asynchronous unmap commands sent 
in CHECKP-less mode.
To avoid overcomplicating things lets assume that the thread receives commands 
through a pipe.
The write end of the pipe is accessible both to the buffer manager (it sends 
asynchronous unmap commands)
and to the transaction main thread as well (to request secondary thread 
shutdown and when leaving CHECKP-less mode,
see below). 

Lets extend VMMState.

struct VMMState
{
        struct VMMStateData
        {
                Xptr currentXPTR;
        };
        
        VMMStateData *currentStateOrSpecial;
        VMMStateData *currentState;
        VMMStateData  statesPool [2];

        Mutex                  mutex;
        VMMUnmapCmd
                                     queuedUnmapCmds [42];
        bool                     didQueueOverflow;
        bool                     isInCHECKPLessMode; 
        int                        pipeFD;
        uint64_t              lastAsyncCmdID;
};

To entering CHECKP-less mode just set isInCHECKPLessMode.
If  isInCHECKPLessMode is set buffer manager will send asynchronous unmap 
commands using pipeFD
instead of writing to queuedUnmapCmds. LastAsyncCmdID records the ID of last 
async command sent.
When leaving CHECKP-less mode unset isInCHECKPLessMode.

To complete the transition from CHECKP-less to normal mode we must ensure that 
all asynchronous
commands sent to the secondary thread are complete. We ask the secondary thread 
to notify 
the main thread as soon as a command with lastAsyncCmdID was processed. This is 
implemented
by sending a command using the same pipe buffer manager uses.


OBVIOUS IDEA 1: Reset lastAsyncCmdID to 0 when entering CHECKP-less mode. If it 
is still 0 when we leave the
mode there is no need to synchronize with the secondary thread.


CLEVER IDEA 1: If there are many transactions in CHECKP-less mode buffer 
manager will slow down considerably
due to the need to write to many pipes. Arrange  transactions in a tree-like 
fashion. Buffer manager sends commands
to root node (1 write), the root node retransmits commands to immediate child 
nodes and so forth.




------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management 
Up to 160% more powerful than alternatives and 25% more efficient. 
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Sedna-discussion mailing list
Sedna-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

[Sedna-discussion] Enhancement proposal #1 (VMM improvements).

Reply via email to