Carlos, Thanks for the help.
@Emmanuel - I know you have a busy schedule however, is there a timeline to when we can expect a fix? Currently we are working a system which might make this scenario of concurrent update and search occur more. Ashma On Thu, Mar 24, 2016 at 6:02 PM, Accorsi, Carlo <carlo.acco...@siemens.com> wrote: > HI Ashma, > We have all our directory content dumped out as an LDIF on a periodic > schedule. If the system crashes, we delete the partition folders on the > file system, restart the server and re-inject the entries from the ldif. > > If your directory is small < 1000 entries, you can just get them perhaps > all in one search result. > Otherwise use a paged result to loop through each entry and call: > > String ldif = LdifUtils.convertToLdif(entry) > > and write / append the string value to a file. This is your backup ldif > file. > > Hope this helps.. > > > > > -----Original Message----- > From: Ashma Shrestha [mailto:ashma.shres...@gmail.com] > Sent: Thursday, March 24, 2016 4:38 PM > To: users@directory.apache.org > Subject: Re: ApacheDS with Mavibot anytime soon? > > Hi Carlos, > > You have mentioned "we get the 'ERR_554 double get for block' and then we > need to restore. We have an automated way of doing this now so it's not too > disruptive when it happens." . I am running into the same ERR_554 issue, > can you let me know how you fixed it? > > Thanks, > > Ashma > > On Mon, Sep 28, 2015 at 9:15 AM, <carlo.acco...@ibs-ag.com> wrote: > > > Emmanuel! Wow!! Thank you for that most detailed and comprehensive > > write up of the concurrency problem you are working to solve with > > Mavibot. The design sounds world class. > > I now fully appreciate the complexity of this transaction issue and > > the ASCII art helped make it clear. We realize you're all doing your > > best to devise a robust solution for this and so we'll sit tight for now. > > I appreciate the time you took to explain this to us. Have fun at the > > conference. > > Best, Carlo Accorsi > > > > > > > > > > -----Original Message----- > > From: Emmanuel Lécharny [mailto:elecha...@gmail.com] > > Sent: Sunday, September 27, 2015 1:23 PM > > To: users@directory.apache.org > > Subject: Re: ApacheDS with Mavibot anytime soon? > > > > Le 23/09/15 21:41, carlo.acco...@ibs-ag.com a écrit : > > > Hi, > > > > hi, > > > > sorry for the delayed response ! I saw your mail 3 days ago, but I > > forgot to answer back then, being caught in a storm of delaying events... > > > > > > We have one customer that wants to get replication implemented. > > > We've > > set it up in house and it seems to work well. > > > The trouble is there are 10's of thousands of users and all sorts of > > concurrent activity. They're on M16 and a few times per month, we get > > the > > 'ERR_554 double get for block' and then we need to restore. > > > We have an automated way of doing this now so it's not too > > > disruptive > > when it happens. > > Hopefully. Still I understand this is not convenient. > > > > > > My question is, would adding replication just add more concurrent > > activity to the servers and lead more frequently to the ERR_554 > situation? > > For the consumer, it will be just as if the updates were coming from a > > standard user. So, no, I don't think it's a real problem. > > > > > > Also he's asked to upgrade to M20 but my hunch is that for this > > > issue it > > won't matter much because we're still on JDBM? > > > > True. It's just that some bugs have been fixed in M20. > > > > > > Finally, we're eager to test out a release with the Mavibot backed. > > > I'm sure it's possible to build ourselves but we're trying to keep > > > the > > moving parts to a minimum and use only ApacheDS releases. > > I understand. > > > > Mavibot work has been delayed, and we are working on it as much as we > > can, but it's not enough. Kiran and I have day jobs atm that forbid us > > to dedicate as much time as would be necessary to complete the last > > part that is mandatory to get this problem fixed : the transaction > > system. We restarted to 'think' about the best implementation 2 weeks > > ago (on your free time) and we expect to be able to spend more time on > > it next week, as we will both be at Budapest, during the Apache > Conference. > > > > Transaction support (the way we need it, ie cross-btree) is not that > > simple to impelment. We have a pretty clear idea on how it should > > work, but that mpacts the current design, as we need to woirk in > > memory, and avoid to copy pages that have already been modifed in the > > transaction boundaries. We also have to deal with the cleanup of old > > versions, which also means we need to implement the version manager. > > > > To give you a insight on what I'm coming for, here is a draft of my > > thoughts about this transaction manager (note that this is my vision, > > that I haven't shared yet, because I don't think it's covering all the > > moving parts, but still, I think it won't be that different to what we > > will end with ) : > > > > Transaction support > > ------------------- > > > > Mavibot must support transaction across many B-tres (ie, whatever the > > number of b-trees we are updating, we should wait before committing > > the changes up to the point the transaction is closed). That means we > > may have many updates pending in memory. In any case, if a rollback is > > done, the modified b-trees will remain intact, in the same state they > > were before the transaction started. > > > > To achive this, we need to work in a working memory. As we may have to > > fetch pages that are to be updated, we will face three cases : > > - the page is not in memory : > > we have to fetch them from disk, update them and store it in the > > working memory we can read it back if needed > > - the page is in the cache : > > we have to copy it to the working memory > > - the page is in the working memory : we simply update it, leaving it > > in the working memory. > > > > So if we have to modify the same page many times, it will be updated > > in the working memory, we won't need to copy it. > > > > That actually change the current Mavibot implementation, as it's > > copying pages no matter what. Here, if the page is coming from the > > working memory, we just update it. > > > > > > Layout > > ------ > > > > Disk Cache Working > > ___ Memory > > / \ .------. > > + + / /| .------. > > |\___/| .------. | .------. | > > + + | | | | | | > > |\___/| ----> | | | ----> | | | > > + + | |/ | |/ > > |\___/| .------. .------. > > + + > > \___/ > > > > When the transaction is rollbacked, we simply delete the content of > > the working memory. > > When the transaction is committed, we write all the pages in the > > working memory on disk, and into the cache, we update the header, and > > we purge the working memory. > > > > If a crash occurs during a transaction, then either the transaction > > is rollbacked (if the process is still running) - which is just about > > purging the working memory -, or there will be nothing to do if we > > have to restart the application, as the modified pages were in memory > > nad have been removed while the process crashed. > > > > OTOH, when we read, we must be sure that the B-trees are present, and > > that we always use the B-tree revisions that are correlated. This is > > critical when using LDAP, where many B-trees are updated during a sinlge > write. > > > > The way to do that is to use a revision, and always use the B-trees > > which revision is just below or equal to this revision. Let's see with > > an example > > > > +-----+---+---+---+---+---+---+---+... > > |\ Rev| | | | | | | | > > | \ | | | | | | | | > > | \ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 > > |B- \ | | | | | | | | > > |tree\| | | | | | | L | N > > +-----+---+---+---+---+---+---+---+... > > | aaa | X | | | X | X | X | | > > | | 1 | 1 | 1 | 4 | 5 | 6 | 6 | > > +-----+---+---+---+---+---+---+---+... > > | bbb | X | X | X | | X | X | X | > > | | 1 | 2 | 3 | 3 | 5 | 6 | 7 | > > +-----+---+---+---+---+---+---+---+... > > | ccc | X | | X | X | X | | X | > > | | 1 | 1 | 3 | 4 | 5 | 5 | 7 | > > +-----+---+---+---+---+---+---+---+... > > | ddd | X | X | | | X | | | > > | | 1 | 2 | 2 | 2 | 5 | 5 | 5 | > > +-----+---+---+---+---+---+---+---+... > > | eee | X | | X | X | | X | | > > | | 1 | 1 | 3 | 4 | 4 | 6 | 6 | > > +-----+---+---+---+---+---+---+---+... > > > > L : Latest revision > > N : New revision > > > > In this table, we have had 7 updates, and 7 revsions. As all the > > B-trees haven't been updated, we may have some 'holes' in the revision > > associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4 > > revisions (1, 4, > > 5 and 6) created, because this B-tree has been updated > > 4 times. > > > > Doing a read on revision 7 will use the following B-trees : aaa[6], > > bbb[7], ccc[7], ddd[5], eee[6]. > > > > > > All in all, a read is associated to a global transaction revision > > which is used as a marker. Every B-tree update is done using the new > revision. > > > > Managing on-going reads > > ----------------------- > > > > We maintain all the B-tree revisions associated with a read, until the > > read is completed, we need to get rid of the old revision when we are > done. > > This is critical to keep the database size to a minimum. In order to > > do that, we must know if an older revision is still in use. We then > > need to keep a common data structure that is shared across threads, > > which will hold all the used revisions, ordered. With the previous > > example, assuming revisions 2, 4 and 6 are in use, we may use a linked > list for that purpose : > > > > +---+ +---+ +---+ > > o--| 2 |<--| 4 |<--| 6 | > > +--2+ +--1+ +--3+ > > > > When the read using rev 4 is done, we remove it from this list : > > +-------------+ > > +---+ | +---+ | +---+ > > o--| 2 |<-+ X-| 4 | +-| 6 | > > +--2+ +--0+ +--3+ > > > > The two remaining revisions are still in use, and one of them is older. > > We can clean the B-tree pages associated with revision 4 if there is > > already a new revision for this B-tree, and for the pages that have > > been copied. > > > > When the read using rev 2 is done, we can clean all the B-tree pages > > that are associated with every revision below 6, as soon as there is a > > B-tree revision above the oldest one. > > > > An other important point : each revision might be used more than once. > > Releasing a read will decrement the count, and when it goes down to 0, > > we can safely remove the element from the list. We need a thread-safe > > data structure for that. > > > > Managing the list > > ----------------- > > > > let's go a bit deeper in the management of this list. It has > > interesting characteristics : > > > > - elements are added by a unique thread : the writter > > - elements are not removed unless the count becomes 0 > > - the count is equal to the maximum number of thread that can use the > > element at some point > > - it's not because the counter is N that N thread *are* using this > > element > > : we may eventually have no thread using it ! (but they *may* come > > back > > later...) > > - each element must be be protected agains concurrent update (which is > > a different requirement than protecting the full list) > > - any thread requiring an element in the list will always use the head > > of the least (which is the most recent) > > - we don't do any list traversal ! Once a thread uses an element, it > > remembers about its position > > - no user Thread is allowed to remove an element from this list. The > > only thread that can do that is the Cleaner Thread > > > > > > The last characteristic is interesting : that means we can safely > > depends on one single thread to do the cleaning, so there is no > > concurrent access on teh list while deleting elements. > > > > The deletion can be done in two ways : > > - from the tail of the list, if the element's counter is equal to zero > > (and we can iterate) > > - from the inner of the list, when the element's counter to be removed > > is zero > > > > The first way is simpler, but it holds potentially a lot of element if > > the oldest one is there for a long time The second way implies we can > > relink elements in the list, ie this is a double-linked list > > > > Let's assume we go for the first way. In any case, the list will > > *always* have at least one element, even if its counter is zero : > > > > head > > +---+ > > | N | revision > > +---+ > > | 0 | users counter (AtomicInteger) > > +---+ > > | |--o > > +---+ > > o--| | > > +---+ > > > > A thread wanting to use revision N will simply increment the counter, > > which is an AtomicInteger, guaranteing that we can't have a wrong > > update of this value. The revision will never change. > > At this point, using the head is quite straight-forward. > > > > * Adding an element > > > > So the writter creates a new revision. This will create a new element : > > > > new head > > +---+ +---+ > > |N+1| | N |<<--- T1, T2, T5 > > +---+ +---+ > > | 0 | | 3 | > > +---+ +---+ > > | |--o | |--o > > +---+ +---+ > > o--| | o--| | > > +---+ +---+ > > > > The Head should always be pointing to one single cell, and any thread > > should be able to get it. This is easy to do if we use an > AtomicReference : > > > > new [head] > > +---+ +---+ > > |N+1| | N |<<--- T1, T2, T5 > > +---+ +---+ > > | 0 | | 3 | > > +---+ +---+ > > | |--o | |--o > > +---+ +---+ > > o--| | o--| | > > +---+ +---+ > > > > Swapping the head is done after we have updated the links, which is > > done by the writter thread : > > > > Step 1 : > > > > Attach the head to the new element > > > > new [head] > > +---+ +---+ > > |N+1| | N |<<--- T1, T2, T5 > > +---+ +---+ > > | 0 | | 3 | > > +---+ +---+ > > | |--o | |--o > > +---+ +---+ > > o--| |<----| | > > +---+ +---+ > > > > Step 2 : > > > > Attach the new element to the head (Here, at the same time, a new > > thread needs the latest revison (still N)) > > > > > > new [head] > > +---+ +---+ > > |N+1| | N |<<--- T1, T2, T5, T7 > > +---+ +---+ > > | 0 | | 4 | > > +---+ +---+ > > | |---->| |--o > > +---+ +---+ > > o--| |<----| | > > +---+ +---+ > > > > Step 3 : > > > > Swap the head > > > > [head] > > +---+ +---+ > > |N+1| | N |<<--- T1, T2, T5, T7 > > +---+ +---+ > > | 0 | | 4 | > > +---+ +---+ > > | |---->| |--o > > +---+ +---+ > > o--| |<----| | > > +---+ +---+ > > > > Step 4 : > > > > A new thread need the latest revision > > > > > > +--- T8 > > [head] | > > +---+ | +---+ > > |N+1|<<-+ | N |<<--- T1, T2, T5, T7 > > +---+ +---+ > > | 1 | | 4 | > > +---+ +---+ > > | |---->| |--o > > +---+ +---+ > > o--| |<----| | > > +---+ +---+ > > > > At this point, no other thread will *ever* be attached to revision N > > > > This can go on an on, with new revisions attached to the list by the > > writter thread. Now, let's consider two different use case : > > 1) when the oldest revision has a counter going down to zero > > 2) when a revision which is not the head nor the oldest that is going > > to zero > > > > First case, a first approach : > > > > This is the simplest. We have to remove the element from the lists, > > and check if its parent's counter is also zero and is not the head. If > > so, we iterate until we meet an element which counter is not zero and > > is not the head. > > > > In any case, we simply have to change its parent's next pointer, and > > to remove the element from the list : > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > > the element, setting teh counter to 0 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | Was 1 > > +---+ +---+ +---+ +---+ > > | |---->| |---->| |---->| |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > The thread that set the counter to 0 > > > > Step 1 : > > > > Detach the last element parent's next pointer > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | > > +---+ +---+ +---+ +---+ > > | |---->| |---->| |--o | |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > Step 2 : > > Iterate if the parent's counter is 0 > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | > > +---+ +---+ +---+ +---+ > > | |---->| |--o | |--o | |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > Step 3 : > > Detach the last element which counter is 0 prev pointer > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | > > +---+ +---+ +---+ +---+ > > | |---->| |--o | |--o | |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| | o--| |<----| | > > +---+ +---+ +---+ +---+ > > > > Step 4 : > > > > We can now dispose the unlinked elements. > > > > What if a thread is freeing an element while we are already free > > another one ? This can be exposed using the previous sample : > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > > the element, setting teh counter to 0 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 1 | | 0 | Was 1 > > +---+ +---+ +---+ +---+ > > | |---->| |---->| |---->| |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > Step 0+ > > > > A thread (10) release the N-1 revison. We have two possibilities here : > > > > 1) The N-2 parent's next is not yet detached > > > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | > > +---+ | +---+ | +---+ +---+ > > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > > the element, setting teh counter to 0 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | Was 1 > > +---+ +---+ +---+ +---+ > > | |---->| |---->| |---->| |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > We just continue walking the list up to the first element with a non > > zero counter > > > > 2) The N-2 paren'ts next is already detached > > > > > > > > +--- T8 +-- T1, T2, T5, T7 > > [head] | | +-- T10 > > +---+ | +---+ | +---+ | +---+ > > |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will > > release the element, setting teh counter to 0 > > +---+ +---+ +---+ +---+ > > | 1 | | 4 | | 0 | | 0 | Was 1 > > +---+ +---+ +---+ +---+ > > | |---->| |---->| |--o | |--o > > +---+ +---+ +---+ +---+ > > o--| |<----| |<----| |<----| | > > +---+ +---+ +---+ +---+ > > > > In this case, we will have 2 threads dealing with two elements to be > > cleaned up. This is not a problem, as T9 will now detach the N-2 > > element from teh N-1 one and T9 will detach the N-1 element from the > > list. All is safe. > > > > About the removed elements > > -------------------------- > > > > Once a thread has detached some elements which counters are equal to > > 0, what should we do with them ? As we will update the disk with the > > removed pages, we have to send them to teh Writter thread. This can be > > done by pushing teminto a deque (and more specifically a > > ConcurrentLinkedDeque) > > > > Another option is that the writer thread handle the cleaning of this > > queue when it is weaken up or periodically. In this case, we don't > > need to push the element sinto a deque, and the reader threads will > > just set the counter to 0. > > > > Second case > > ----------- > > > > What if we accept the removal of elements from the middle of the list ? > > This is way more complex, as we have to be sure we don't break the chain. > > One way to get rid of this constraint is to enforce the writer thread > > to do the job, by periodically checing the whole list (which shuld not > > be too costly). In this case, the reader thread never update the > > element's parent's next pointer, they just decrement teh counter. > > When this counter becomes 0, then we can remove the element. > > > > Solution > > ======== > > > > We will separate the threads into 2 categories : > > - readers (we may have many) > > - writer (we have only one) > > > > The readers always peek the latest revision (which is the head of the > > list). It then increment the counter. When teh thread does not need > > the revision anymore, the counter is decremented. This is the only > > action the reader thread will ever do on the elements and on the list, > > but it will always keep a reference to the element into the session. > > > > The writer thread has a lot more to do : > > - it is responsible for adding element at the top of the list (see > upper). > > Once the head is swapped, every new reader will get the added element > > as a reference. > > - it also has to check in the list for the elements which counter is > > 0, if teh list is longer than 1 element. This check can be done after > > each update, or periodically (every N seconds if the number of > > elements is > 1, or after M additions) if tht help saving updates on > disk. > > > > Referencing pages > > ----------------- > > > > All the B-tree revisions are kept into the BoB B-tree, so it's easy to > > retrieve the rootPage for a given revision. > > > > As all the pages can't be hold in memory, we sometime have to reach a > > page from disk or from the cache. This is done following this simple > algorithm : > > > > for a given page with offset O > > if the cache contains an instance for O > > return the page > > else > > fetch the page with offset O from disk > > store it in the cache > > > > In our case, we just add a layer, which will check in the working > > memory for the instance of a page, given its offset. > > > > > > > > > -- > Ashma Shrestha > PHP Developer > (770) 310-9456 > ashma.shres...@gmail.com > -- Ashma Shrestha PHP Developer (770) 310-9456 ashma.shres...@gmail.com