Emmanuel! Wow!! Thank you for that most detailed and comprehensive write up of the concurrency problem you are working to solve with Mavibot. The design sounds world class. I now fully appreciate the complexity of this transaction issue and the ASCII art helped make it clear. We realize you're all doing your best to devise a robust solution for this and so we'll sit tight for now. I appreciate the time you took to explain this to us. Have fun at the conference. Best, Carlo Accorsi
-----Original Message----- From: Emmanuel Lécharny [mailto:[email protected]] Sent: Sunday, September 27, 2015 1:23 PM To: [email protected] Subject: Re: ApacheDS with Mavibot anytime soon? Le 23/09/15 21:41, [email protected] a écrit : > Hi, hi, sorry for the delayed response ! I saw your mail 3 days ago, but I forgot to answer back then, being caught in a storm of delaying events... > > We have one customer that wants to get replication implemented. We've set it > up in house and it seems to work well. > The trouble is there are 10's of thousands of users and all sorts of > concurrent activity. They're on M16 and a few times per month, we get the > 'ERR_554 double get for block' and then we need to restore. > We have an automated way of doing this now so it's not too disruptive when it > happens. Hopefully. Still I understand this is not convenient. > > My question is, would adding replication just add more concurrent activity to > the servers and lead more frequently to the ERR_554 situation? For the consumer, it will be just as if the updates were coming from a standard user. So, no, I don't think it's a real problem. > > Also he's asked to upgrade to M20 but my hunch is that for this issue it > won't matter much because we're still on JDBM? True. It's just that some bugs have been fixed in M20. > > Finally, we're eager to test out a release with the Mavibot backed. > I'm sure it's possible to build ourselves but we're trying to keep the moving > parts to a minimum and use only ApacheDS releases. I understand. Mavibot work has been delayed, and we are working on it as much as we can, but it's not enough. Kiran and I have day jobs atm that forbid us to dedicate as much time as would be necessary to complete the last part that is mandatory to get this problem fixed : the transaction system. We restarted to 'think' about the best implementation 2 weeks ago (on your free time) and we expect to be able to spend more time on it next week, as we will both be at Budapest, during the Apache Conference. Transaction support (the way we need it, ie cross-btree) is not that simple to impelment. We have a pretty clear idea on how it should work, but that mpacts the current design, as we need to woirk in memory, and avoid to copy pages that have already been modifed in the transaction boundaries. We also have to deal with the cleanup of old versions, which also means we need to implement the version manager. To give you a insight on what I'm coming for, here is a draft of my thoughts about this transaction manager (note that this is my vision, that I haven't shared yet, because I don't think it's covering all the moving parts, but still, I think it won't be that different to what we will end with ) : Transaction support ------------------- Mavibot must support transaction across many B-tres (ie, whatever the number of b-trees we are updating, we should wait before committing the changes up to the point the transaction is closed). That means we may have many updates pending in memory. In any case, if a rollback is done, the modified b-trees will remain intact, in the same state they were before the transaction started. To achive this, we need to work in a working memory. As we may have to fetch pages that are to be updated, we will face three cases : - the page is not in memory : we have to fetch them from disk, update them and store it in the working memory we can read it back if needed - the page is in the cache : we have to copy it to the working memory - the page is in the working memory : we simply update it, leaving it in the working memory. So if we have to modify the same page many times, it will be updated in the working memory, we won't need to copy it. That actually change the current Mavibot implementation, as it's copying pages no matter what. Here, if the page is coming from the working memory, we just update it. Layout ------ Disk Cache Working ___ Memory / \ .------. + + / /| .------. |\___/| .------. | .------. | + + | | | | | | |\___/| ----> | | | ----> | | | + + | |/ | |/ |\___/| .------. .------. + + \___/ When the transaction is rollbacked, we simply delete the content of the working memory. When the transaction is committed, we write all the pages in the working memory on disk, and into the cache, we update the header, and we purge the working memory. If a crash occurs during a transaction, then either the transaction is rollbacked (if the process is still running) - which is just about purging the working memory -, or there will be nothing to do if we have to restart the application, as the modified pages were in memory nad have been removed while the process crashed. OTOH, when we read, we must be sure that the B-trees are present, and that we always use the B-tree revisions that are correlated. This is critical when using LDAP, where many B-trees are updated during a sinlge write. The way to do that is to use a revision, and always use the B-trees which revision is just below or equal to this revision. Let's see with an example +-----+---+---+---+---+---+---+---+... |\ Rev| | | | | | | | | \ | | | | | | | | | \ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |B- \ | | | | | | | | |tree\| | | | | | | L | N +-----+---+---+---+---+---+---+---+... | aaa | X | | | X | X | X | | | | 1 | 1 | 1 | 4 | 5 | 6 | 6 | +-----+---+---+---+---+---+---+---+... | bbb | X | X | X | | X | X | X | | | 1 | 2 | 3 | 3 | 5 | 6 | 7 | +-----+---+---+---+---+---+---+---+... | ccc | X | | X | X | X | | X | | | 1 | 1 | 3 | 4 | 5 | 5 | 7 | +-----+---+---+---+---+---+---+---+... | ddd | X | X | | | X | | | | | 1 | 2 | 2 | 2 | 5 | 5 | 5 | +-----+---+---+---+---+---+---+---+... | eee | X | | X | X | | X | | | | 1 | 1 | 3 | 4 | 4 | 6 | 6 | +-----+---+---+---+---+---+---+---+... L : Latest revision N : New revision In this table, we have had 7 updates, and 7 revsions. As all the B-trees haven't been updated, we may have some 'holes' in the revision associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4, 5 and 6) created, because this B-tree has been updated 4 times. Doing a read on revision 7 will use the following B-trees : aaa[6], bbb[7], ccc[7], ddd[5], eee[6]. All in all, a read is associated to a global transaction revision which is used as a marker. Every B-tree update is done using the new revision. Managing on-going reads ----------------------- We maintain all the B-tree revisions associated with a read, until the read is completed, we need to get rid of the old revision when we are done. This is critical to keep the database size to a minimum. In order to do that, we must know if an older revision is still in use. We then need to keep a common data structure that is shared across threads, which will hold all the used revisions, ordered. With the previous example, assuming revisions 2, 4 and 6 are in use, we may use a linked list for that purpose : +---+ +---+ +---+ o--| 2 |<--| 4 |<--| 6 | +--2+ +--1+ +--3+ When the read using rev 4 is done, we remove it from this list : +-------------+ +---+ | +---+ | +---+ o--| 2 |<-+ X-| 4 | +-| 6 | +--2+ +--0+ +--3+ The two remaining revisions are still in use, and one of them is older. We can clean the B-tree pages associated with revision 4 if there is already a new revision for this B-tree, and for the pages that have been copied. When the read using rev 2 is done, we can clean all the B-tree pages that are associated with every revision below 6, as soon as there is a B-tree revision above the oldest one. An other important point : each revision might be used more than once. Releasing a read will decrement the count, and when it goes down to 0, we can safely remove the element from the list. We need a thread-safe data structure for that. Managing the list ----------------- let's go a bit deeper in the management of this list. It has interesting characteristics : - elements are added by a unique thread : the writter - elements are not removed unless the count becomes 0 - the count is equal to the maximum number of thread that can use the element at some point - it's not because the counter is N that N thread *are* using this element : we may eventually have no thread using it ! (but they *may* come back later...) - each element must be be protected agains concurrent update (which is a different requirement than protecting the full list) - any thread requiring an element in the list will always use the head of the least (which is the most recent) - we don't do any list traversal ! Once a thread uses an element, it remembers about its position - no user Thread is allowed to remove an element from this list. The only thread that can do that is the Cleaner Thread The last characteristic is interesting : that means we can safely depends on one single thread to do the cleaning, so there is no concurrent access on teh list while deleting elements. The deletion can be done in two ways : - from the tail of the list, if the element's counter is equal to zero (and we can iterate) - from the inner of the list, when the element's counter to be removed is zero The first way is simpler, but it holds potentially a lot of element if the oldest one is there for a long time The second way implies we can relink elements in the list, ie this is a double-linked list Let's assume we go for the first way. In any case, the list will *always* have at least one element, even if its counter is zero : head +---+ | N | revision +---+ | 0 | users counter (AtomicInteger) +---+ | |--o +---+ o--| | +---+ A thread wanting to use revision N will simply increment the counter, which is an AtomicInteger, guaranteing that we can't have a wrong update of this value. The revision will never change. At this point, using the head is quite straight-forward. * Adding an element So the writter creates a new revision. This will create a new element : new head +---+ +---+ |N+1| | N |<<--- T1, T2, T5 +---+ +---+ | 0 | | 3 | +---+ +---+ | |--o | |--o +---+ +---+ o--| | o--| | +---+ +---+ The Head should always be pointing to one single cell, and any thread should be able to get it. This is easy to do if we use an AtomicReference : new [head] +---+ +---+ |N+1| | N |<<--- T1, T2, T5 +---+ +---+ | 0 | | 3 | +---+ +---+ | |--o | |--o +---+ +---+ o--| | o--| | +---+ +---+ Swapping the head is done after we have updated the links, which is done by the writter thread : Step 1 : Attach the head to the new element new [head] +---+ +---+ |N+1| | N |<<--- T1, T2, T5 +---+ +---+ | 0 | | 3 | +---+ +---+ | |--o | |--o +---+ +---+ o--| |<----| | +---+ +---+ Step 2 : Attach the new element to the head (Here, at the same time, a new thread needs the latest revison (still N)) new [head] +---+ +---+ |N+1| | N |<<--- T1, T2, T5, T7 +---+ +---+ | 0 | | 4 | +---+ +---+ | |---->| |--o +---+ +---+ o--| |<----| | +---+ +---+ Step 3 : Swap the head [head] +---+ +---+ |N+1| | N |<<--- T1, T2, T5, T7 +---+ +---+ | 0 | | 4 | +---+ +---+ | |---->| |--o +---+ +---+ o--| |<----| | +---+ +---+ Step 4 : A new thread need the latest revision +--- T8 [head] | +---+ | +---+ |N+1|<<-+ | N |<<--- T1, T2, T5, T7 +---+ +---+ | 1 | | 4 | +---+ +---+ | |---->| |--o +---+ +---+ o--| |<----| | +---+ +---+ At this point, no other thread will *ever* be attached to revision N This can go on an on, with new revisions attached to the list by the writter thread. Now, let's consider two different use case : 1) when the oldest revision has a counter going down to zero 2) when a revision which is not the head nor the oldest that is going to zero First case, a first approach : This is the simplest. We have to remove the element from the lists, and check if its parent's counter is also zero and is not the head. If so, we iterate until we meet an element which counter is not zero and is not the head. In any case, we simply have to change its parent's next pointer, and to remove the element from the list : +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release the element, setting teh counter to 0 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | Was 1 +---+ +---+ +---+ +---+ | |---->| |---->| |---->| |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ The thread that set the counter to 0 Step 1 : Detach the last element parent's next pointer +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | +---+ +---+ +---+ +---+ | |---->| |---->| |--o | |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ Step 2 : Iterate if the parent's counter is 0 +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | +---+ +---+ +---+ +---+ | |---->| |--o | |--o | |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ Step 3 : Detach the last element which counter is 0 prev pointer +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | +---+ +---+ +---+ +---+ | |---->| |--o | |--o | |--o +---+ +---+ +---+ +---+ o--| |<----| | o--| |<----| | +---+ +---+ +---+ +---+ Step 4 : We can now dispose the unlinked elements. What if a thread is freeing an element while we are already free another one ? This can be exposed using the previous sample : +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release the element, setting teh counter to 0 +---+ +---+ +---+ +---+ | 1 | | 4 | | 1 | | 0 | Was 1 +---+ +---+ +---+ +---+ | |---->| |---->| |---->| |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ Step 0+ A thread (10) release the N-1 revison. We have two possibilities here : 1) The N-2 parent's next is not yet detached +--- T8 +-- T1, T2, T5, T7 [head] | | +---+ | +---+ | +---+ +---+ |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release the element, setting teh counter to 0 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | Was 1 +---+ +---+ +---+ +---+ | |---->| |---->| |---->| |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ We just continue walking the list up to the first element with a non zero counter 2) The N-2 paren'ts next is already detached +--- T8 +-- T1, T2, T5, T7 [head] | | +-- T10 +---+ | +---+ | +---+ | +---+ |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release the element, setting teh counter to 0 +---+ +---+ +---+ +---+ | 1 | | 4 | | 0 | | 0 | Was 1 +---+ +---+ +---+ +---+ | |---->| |---->| |--o | |--o +---+ +---+ +---+ +---+ o--| |<----| |<----| |<----| | +---+ +---+ +---+ +---+ In this case, we will have 2 threads dealing with two elements to be cleaned up. This is not a problem, as T9 will now detach the N-2 element from teh N-1 one and T9 will detach the N-1 element from the list. All is safe. About the removed elements -------------------------- Once a thread has detached some elements which counters are equal to 0, what should we do with them ? As we will update the disk with the removed pages, we have to send them to teh Writter thread. This can be done by pushing teminto a deque (and more specifically a ConcurrentLinkedDeque) Another option is that the writer thread handle the cleaning of this queue when it is weaken up or periodically. In this case, we don't need to push the element sinto a deque, and the reader threads will just set the counter to 0. Second case ----------- What if we accept the removal of elements from the middle of the list ? This is way more complex, as we have to be sure we don't break the chain. One way to get rid of this constraint is to enforce the writer thread to do the job, by periodically checing the whole list (which shuld not be too costly). In this case, the reader thread never update the element's parent's next pointer, they just decrement teh counter. When this counter becomes 0, then we can remove the element. Solution ======== We will separate the threads into 2 categories : - readers (we may have many) - writer (we have only one) The readers always peek the latest revision (which is the head of the list). It then increment the counter. When teh thread does not need the revision anymore, the counter is decremented. This is the only action the reader thread will ever do on the elements and on the list, but it will always keep a reference to the element into the session. The writer thread has a lot more to do : - it is responsible for adding element at the top of the list (see upper). Once the head is swapped, every new reader will get the added element as a reference. - it also has to check in the list for the elements which counter is 0, if teh list is longer than 1 element. This check can be done after each update, or periodically (every N seconds if the number of elements is > 1, or after M additions) if tht help saving updates on disk. Referencing pages ----------------- All the B-tree revisions are kept into the BoB B-tree, so it's easy to retrieve the rootPage for a given revision. As all the pages can't be hold in memory, we sometime have to reach a page from disk or from the cache. This is done following this simple algorithm : for a given page with offset O if the cache contains an instance for O return the page else fetch the page with offset O from disk store it in the cache In our case, we just add a layer, which will check in the working memory for the instance of a page, given its offset.
