Hi everyone! I haven't grok Sedna source code for years. When I finally fetched the sources from git this week (to check it on OS X Lion) I was struct with a desire to contribute to Sedna :) So here is my enhancement proposal #1.
=== ABSTRACT === The world of computer hardware has drastically changed. Multicore CPU's is our reality. Many software designs require tweaks or even a major overhaul to take the full advantage of multicores. One particular design needing attention badly is the framework to reclaim buffer memory. I don't pretend having a slightest idea how to pick a buffer to reclaim in a smarter way than it is currently done. However the machinery to notify transactions of the buffer gone and to protect the buffer a transaction currently works with from being reclaimed can be improved. The problem is that the current design doesn't scale well with the increase in the number of active transactions. The problem is actually threefold. First the core algorithm processes transactions sequentially, each transaction requiring multiple IPC calls to carry out processing. It would be better if we could process transactions in parallel. Or we could make the processing much more faster so that it would be ok to stay sequential. Second the algorithm may restart from the very beginning if it discovers that the buffer it attempts to reclaim is currently in use by a transaction. I believe this being a minor problem but nevertheless worth mentioning. Finally the algoritm to process a transaction is synchronous: basically we send a message to a transaction process and wait for reply. It would be better if we didn't have to wait for reply at all: i.e. go asynchronous. THE SOLUTION: move more state to the shared memory and use lock-free approach. THE COST: one InterlockedCompareExchange or similar intrinsic in CHECKP, one conditional in CHECKP, slightly more complex code due to indirection required to access the state in shared memory in CHECKP It is expected that the implyed overhead is low: Andrew Fomitchev got somewhat near 7% slowdown when evaluating a different design with a spinlock in CHECKP which is very similar to the proposed design both in the terms of code size increase and the use of interlocked intrinsic. Or does my memory serve me bad and 7% being just the gay/total population ratio? === CURRENT DESIGN OVERVIEW === If a buffer manager (BM) runs short of the buffer memory it has to reclaim a buffer. Buffer eviction policy was FIFO a few years ago, however it doesn't matter at all in the scope of the current discussion. So BM picks a buffer somehow and attempts to reclaim it. Unmap_block_in_trs() does the job. If it failed to reclaim the buffer because the buffer was in use when the call was made BM picks another buffer and retries to reclaim. These steps repeat untill unmap_block_in_trs() finally succeeds. Unmap_block_in_trs() basically calls unmap_block_in_tr() for each transaction registered. Unmap_block_in_tr() wakes the VMM thread in the corresponding transaction's process. There is a semaphore for this purpose. Then unmap_block_in_tr() waits for reply (another semaphore). VMM thread suspends the main transaction thread and checks the current XPTR to decide whether it is ok to unmap the requested block. If the decision was positive VMM thread unmaps the block. Finally it resumes the main thread, passes the decision result back to unmap_block_in_tr() through shared memory and wakes unmap_block_in_tr(). The design is slightly different on UNIX - instead of SuspendThread()/ResumeThread() VMM thread sends a signal to the main transaction thread and most code execute in the signal handler. === NEW DESIGN PROPOSAL === So what if the buffer manager could check the current XPTR directly and send unmap requests without the need to wait for completion? The following discussion omits some technical detail inherent to distinct address spaces in TR and SM. For simplicity we describe our design as if everything happened in one address space. The real implementation will be somewhat more complicated. So we need to move the current XPTR to shared memory, huh? struct VMMState { Xptr currentXPTR; }; VMMState *pVMMState; /* allocated in the shared memory */ Wait! We won't be able to set pVMMState->currentXPTR atomically. struct VMMState { struct VMMStateData { Xptr currentXPTR; }; VMMStateData *currentState; VMMStateData statesPool [2]; }; Lets apply the common aproach: if in need to modify a complex structure atomically, allocate and initialize a new one and then swap pointers atomically. Since there is a single thread modifying the currentXPTR two instances of VMMStateData would be enough so we use statesPool instead of malloc(). So far so good. But there is currently no way for the unmap request to step in. struct VMMState { struct VMMStateData { Xptr currentXPTR; }; VMMStateData *currentStateOrSpecial; VMMStateData *currentState; VMMStateData statesPool [2]; Mutex mutex; VMMUnmapCmd queuedUnmapCmds [42]; bool didQueueOverflow; }; CurrentStateOrSpecial stores a pointer to the current state much like currentState. However the buffer manager can swap the current state pointer in CurrentStateOrSpecial with a special value to indicate that VMMState has been externally modified. CHECKP will update both currentState (regular assignment) and currentStateOrSpecial (InterlockedCompareExchange). Most of the time InterlockedCompareExchange will succeed since the buffer manager didn't step in and currentStateOrSpecial == currentState. When the buffer manager processes VMMState it locks the mutex first and then swaps currentStateOrSpecial with a 'special' value. The operation yields the former currentStateOrSpecial value which we use to check whether the buffer we intend to unmap is currently in use. We then update queuedUnmapCmds/ didQueueOverflow fields and unlock the VMMState. When InterlockedCompareExchange in CHECKP fails it means that the VMMState was modified by the buffer manager. In the later case the "slow path" is executed. We lock the mutex and execute whatever queued umap cmds or reset VMM completely if the queue has overflowed. We then store a pointer to the current state in currentStateOrSpecial and unlock the mutex. "Slow path" is obviously going to be a separate function (CHECKP code is inlined). === Q&A === Q1: Is it safe for the buffer manager to proceed without waiting for unmap command to complete? A1: There must be a CHECKP before the unmapped memory is going to be accessed. CHECKP performs all queued commands before processing. Don't write buggy code. Q2: Is it safe for the buffer manager to access currentState data which is being concurrently modified? A2: Sure. There are two instances of the state. The pattern is as follows: modify B, replace pointer to A with pointer to B; modify A, replace pointer to B with pointer to A. The state instance is never modified when accessible via currentState pointer. Assume we have just swapped currentStateOrSpecial with 'special' and find out that the current state is A. A <- (currentStateOrSpecial <- SPECIAL) A is not going to be modified since it is stored in currentState. CurrentState must point to B before A could be modified. But replacing A with B in currentState requires currentStateOrSpecial update as well which will fail since we've stored a special value there. The failure will trigger the "slow path" execution and it locks pVMMState->mutex. But the buffer manager locks VMMState mutex before touching currentStateOrSpecial, so the slow path will block until we have done with the currentState data and have released the mutex. === CONCLUSION === The presented design enbles to reclaim a buffer efficiently and it scales well with the number of active transactions. Transactions are still processed sequentially. This is a non-issue however since much lighter synchronization primitives are used and unmap requests are asynchronous. The expected performance degradation in single user scenarios due to CHECKP complications is negligible. === CONCERNING NON CHECKP-INTENSIVE CODE === The presented design assumes that CHECKP is executed quite often to process queued unmap commands in the timely fashion. Though queue overflow is non fatal ("only" need to reset the entire VMM), it implies non-trivial overhead. We obviously need CHECKP-less mode to activate when (ex.) waiting for the lock on XML document to be granted or talking with a client through a socket. The interesting property of CHECKP-less mode is that there are no CHECKPs being executed, hence VMM is not being accessed, hence there is no need to check the currentXptr. There is still the need to send asynchronous unmap commands. Lets introduce a secondary thread to process asynchronous unmap commands sent in CHECKP-less mode. To avoid overcomplicating things lets assume that the thread receives commands through a pipe. The write end of the pipe is accessible both to the buffer manager (it sends asynchronous unmap commands) and to the transaction main thread as well (to request secondary thread shutdown and when leaving CHECKP-less mode, see below). Lets extend VMMState. struct VMMState { struct VMMStateData { Xptr currentXPTR; }; VMMStateData *currentStateOrSpecial; VMMStateData *currentState; VMMStateData statesPool [2]; Mutex mutex; VMMUnmapCmd queuedUnmapCmds [42]; bool didQueueOverflow; bool isInCHECKPLessMode; int pipeFD; uint64_t lastAsyncCmdID; }; To entering CHECKP-less mode just set isInCHECKPLessMode. If isInCHECKPLessMode is set buffer manager will send asynchronous unmap commands using pipeFD instead of writing to queuedUnmapCmds. LastAsyncCmdID records the ID of last async command sent. When leaving CHECKP-less mode unset isInCHECKPLessMode. To complete the transition from CHECKP-less to normal mode we must ensure that all asynchronous commands sent to the secondary thread are complete. We ask the secondary thread to notify the main thread as soon as a command with lastAsyncCmdID was processed. This is implemented by sending a command using the same pipe buffer manager uses. OBVIOUS IDEA 1: Reset lastAsyncCmdID to 0 when entering CHECKP-less mode. If it is still 0 when we leave the mode there is no need to synchronize with the secondary thread. CLEVER IDEA 1: If there are many transactions in CHECKP-less mode buffer manager will slow down considerably due to the need to write to many pipes. Arrange transactions in a tree-like fashion. Buffer manager sends commands to root node (1 write), the root node retransmits commands to immediate child nodes and so forth. ------------------------------------------------------------------------------ EMC VNX: the world's simplest storage, starting under $10K The only unified storage solution that offers unified management Up to 160% more powerful than alternatives and 25% more efficient. Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev _______________________________________________ Sedna-discussion mailing list Sedna-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sedna-discussion