RE: ApacheDS with Mavibot anytime soon?

Carlo.Accorsi Mon, 28 Sep 2015 07:15:55 -0700

Emmanuel! Wow!! Thank you for that most detailed and comprehensive write up of 
the concurrency problem you are working to solve with Mavibot.  The design 
sounds world class.
I now fully appreciate the complexity of this transaction issue and the ASCII 
art helped make it clear. We realize you're all doing your 
best to devise a robust solution for this and so we'll sit tight for now. I 
appreciate the time you took to explain this to us. Have fun at the conference. 
Best, Carlo Accorsi

-----Original Message-----
From: Emmanuel Lécharny [mailto:[email protected]] 
Sent: Sunday, September 27, 2015 1:23 PM
To: [email protected]
Subject: Re: ApacheDS with Mavibot anytime soon?

Le 23/09/15 21:41, [email protected] a écrit :
> Hi,

hi,

sorry for the delayed response ! I saw your mail 3 days ago, but I forgot to 
answer back then, being caught in a storm of delaying events...
>
> We have one customer that wants to get replication implemented. We've set it 
> up in house and it seems to work well.
> The trouble is there are 10's of thousands of users and all sorts of 
> concurrent activity. They're on M16 and a few times per month, we get the 
> 'ERR_554 double get for block' and then we need to restore.
> We have an automated way of doing this now so it's not too disruptive when it 
> happens.
Hopefully. Still I understand this is not convenient.
>
> My question is, would adding replication just add more concurrent activity to 
> the servers and lead more frequently to the ERR_554 situation?
For the consumer, it will be just as if the updates were coming from a standard 
user. So, no, I don't think it's a real problem.
>
> Also he's asked to upgrade to M20 but my hunch is that for this issue it 
> won't matter much because we're still on JDBM?

True. It's just that some bugs have been fixed in M20.
>
> Finally, we're eager to test out a release with the Mavibot backed. 
> I'm sure it's possible to build ourselves but we're trying to keep the moving 
> parts to a minimum and use only ApacheDS releases.
I understand.

Mavibot work has been delayed, and we are working on it as much as we can, but 
it's not enough. Kiran and I have day jobs atm that forbid us to dedicate as 
much time as would be necessary to complete the last part that is mandatory to 
get this problem fixed : the transaction system. We restarted to 'think' about 
the best implementation 2 weeks ago (on your free time) and we expect to be 
able to spend more time on it next week, as we will both be at Budapest, during 
the Apache Conference.

Transaction  support (the way we need it, ie cross-btree) is not that simple to 
impelment. We have a pretty clear idea on how it should work, but that mpacts 
the current design, as we need to woirk in memory, and avoid to copy pages that 
have already been modifed in the transaction boundaries. We also have to deal 
with the cleanup of old versions, which also means we need to implement the 
version manager.

To give you a insight on what I'm coming for, here is a draft of my thoughts 
about this transaction manager (note that this is my vision, that I haven't 
shared yet, because I don't think it's covering all the moving parts, but 
still, I think it won't be that different to what we will end with ) :

Transaction support
-------------------

Mavibot must support transaction across many B-tres (ie, whatever the number of 
b-trees we are updating, we should wait before committing the changes up to the 
point the transaction is closed). That means we may have many updates pending 
in memory. In any case, if a rollback is done, the modified b-trees will remain 
intact, in the same state they were before the transaction started.

To achive this, we need to work in a working memory. As we may have to fetch 
pages that are to be updated, we will face three cases :
- the page is not in memory :
we have to fetch them from disk, update them and store it in the working memory 
we can read it back if needed
- the page is in the cache :
we have to copy it to the working memory
- the page is in the working memory : we simply update it, leaving it in the 
working memory.

So if we have to modify the same page many times, it will be updated in the 
working memory, we won't need to copy it.

That actually change the current Mavibot implementation, as it's copying pages 
no matter what. Here, if the page is coming from the working memory, we just 
update it.

Layout
------

 Disk            Cache           Working
  ___                            Memory
 /   \          .------.       
+     +        /      /|         .------.
|\___/|       .------. |       .------. |
+     +       |      | |       |      | |
|\___/| ----> |      | | ----> |      | |
+     +       |      |/        |      |/
|\___/|       .------.         .------.
+     +
 \___/

 When the transaction is rollbacked, we simply delete the content of the 
working memory.
 When the transaction is committed, we write all the pages in the working 
memory on disk, and into the cache, we update the header, and we purge the 
working memory.

 If a crash occurs during a transaction, then either the transaction is 
rollbacked (if the process is still running) - which is just about purging the 
working memory -, or there will be nothing to do if we have to restart the 
application, as the modified pages were in memory nad have been removed while 
the process crashed.

OTOH, when we read, we must be sure that the B-trees are present, and that we 
always use the B-tree revisions that are correlated. This is critical when 
using LDAP, where many B-trees are updated during a sinlge write.

The way to do that is to use a revision, and always use the B-trees which 
revision is just below or equal to this revision. Let's see with an example

 +-----+---+---+---+---+---+---+---+...
 |\ Rev|   |   |   |   |   |   |   |
 | \   |   |   |   |   |   |   |   |
 |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
 |B- \ |   |   |   |   |   |   |   |
 |tree\|   |   |   |   |   |   | L | N
 +-----+---+---+---+---+---+---+---+...
 | aaa | X |   |   | X | X | X |   |
 |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 | bbb | X | X | X |   | X | X | X |
 |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ccc | X |   | X | X | X |   | X |
 |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ddd | X | X |   |   | X |   |   |
 |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
 +-----+---+---+---+---+---+---+---+...
 | eee | X |   | X | X |   | X |   |
 |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...

L : Latest revision
N : New revision

In this table, we have had 7 updates, and 7 revsions. As all the B-trees 
haven't been updated, we may have some 'holes' in the revision associated to a 
B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4, 5 and 6) 
created, because this B-tree has been updated
4 times.

Doing a read on revision 7 will use the following B-trees : aaa[6], bbb[7], 
ccc[7], ddd[5], eee[6].

All in all, a read is associated to a global transaction revision which is used 
as a marker. Every B-tree update is done using the new revision.

Managing on-going reads
-----------------------

We maintain all the B-tree revisions associated with a read, until the read is 
completed, we need to get rid of the old revision when we are done. This is 
critical to keep the database size to a minimum. In order to do that, we must 
know if an older revision is still in use. We then need to keep a common data 
structure that is shared across threads, which will hold all the used 
revisions, ordered. With the previous example, assuming revisions 2, 4 and 6 
are in use, we may use a linked list for that purpose :

   +---+   +---+   +---+
o--| 2 |<--| 4 |<--| 6 |
   +--2+   +--1+   +--3+

When the read using rev 4 is done, we remove it from this list :
          +-------------+
   +---+  |    +---+    | +---+
o--| 2 |<-+  X-| 4 |    +-| 6 |
   +--2+       +--0+      +--3+

The two remaining revisions are still in use, and one of them is older.
We can clean the B-tree pages associated with revision 4 if there is already a 
new revision for this B-tree, and for the pages that have been copied.

When the read using rev 2 is done, we can clean all the B-tree pages that are 
associated with every revision below 6, as soon as there is a B-tree revision 
above the oldest one.

An other important point : each revision might be used more than once.
Releasing a read will decrement the count, and when it goes down to 0, we can 
safely remove the element from the list. We need a thread-safe data structure 
for that.

Managing the list
-----------------

let's go a bit deeper in the management of this list. It has interesting 
characteristics :

- elements are added by a unique thread : the writter
- elements are not removed unless the count becomes 0
- the count is equal to the maximum number of thread that can use the element 
at some point
- it's not because the counter is N that N thread *are* using this element : we 
may eventually have no thread using it ! (but they *may* come back later...)
- each element must be be protected agains concurrent update (which is a 
different requirement than protecting the full list)
- any thread requiring an element in the list will always use the head of the 
least (which is the most recent)
- we don't do any list traversal ! Once a thread uses an element, it remembers 
about its position
- no user Thread is allowed to remove an element from this list. The only 
thread that can do that is the Cleaner Thread

The last characteristic is interesting : that means we can safely depends on 
one single thread to do the cleaning, so there is no concurrent access on teh 
list while deleting elements.

The deletion can be done in two ways :
- from the tail of the list, if the element's counter is equal to zero (and we 
can iterate)
- from the inner of the list, when the element's counter to be removed is zero

The first way is simpler, but it holds potentially a lot of element if the 
oldest one is there for a long time The second way implies we can relink 
elements in the list, ie this is a double-linked list

Let's assume we go for the first way. In any case, the list will
*always* have at least one element, even if its counter is zero :

   head
   +---+
   | N | revision
   +---+
   | 0 | users counter (AtomicInteger)
   +---+
   |   |--o
   +---+
o--|   |
   +---+

A thread wanting to use revision N will simply increment the counter, which is 
an AtomicInteger, guaranteing that we can't have a wrong update of this value. 
The revision will never change.
At this point, using the head is quite straight-forward.

* Adding an element

So the writter creates a new revision. This will create a new element :

    new      head
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+

The Head should always be pointing to one single cell, and any thread should be 
able to get it. This is easy to do if we use an AtomicReference :

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+

Swapping the head is done after we have updated the links, which is done by the 
writter thread :

Step 1 :

Attach the head to the new element

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 2 :

Attach the new element to the head (Here, at the same time, a new thread needs 
the latest revison (still N))

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 3 :

Swap the head

   [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 4 :

A new thread need the latest revision

           +--- T8
   [head]  |
   +---+   | +---+
   |N+1|<<-+ | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 1 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

At this point, no other thread will *ever* be attached to revision N

This can go on an on, with new revisions attached to the list by the writter 
thread. Now, let's consider two different use case :
1) when the oldest revision has a counter going down to zero
2) when a revision which is not the head nor the oldest that is going to zero

First case, a first approach :

This is the simplest. We have to remove the element from the lists, and check 
if its parent's counter is also zero and is not the head. If so, we iterate 
until we meet an element which counter is not zero and is not the head.

In any case, we simply have to change its parent's next pointer, and to remove 
the element from the list :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

The thread that set the counter to 0

Step 1 :

Detach the last element parent's next pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 2 :
Iterate if the parent's counter is 0

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |--o  |   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 3 :
Detach the last element which counter is 0 prev pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   |     +---+     +---+
   |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
   +---+     +---+         +---+     +---+
   | 1 |     | 4 |         | 0 |     | 0 |
   +---+     +---+         +---+     +---+
   |   |---->|   |--o      |   |--o  |   |--o
   +---+     +---+         +---+     +---+
o--|   |<----|   |      o--|   |<----|   |
   +---+     +---+         +---+     +---+

Step 4 :

 We can now dispose the unlinked elements.

What if a thread is freeing an element while we are already free another one ? 
This can be exposed using the previous sample :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 1 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 0+

A thread (10) release the N-1 revison. We have two possibilities here :

1) The N-2 parent's next is not yet detached

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

We just continue walking the list up to the first element with a non zero 
counter

2) The N-2 paren'ts next is already detached

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |         +-- T10
   +---+   | +---+   | +---+   | +---+
   |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release the 
element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

In this case, we will have 2 threads dealing with two elements to be cleaned 
up. This is not a problem, as T9 will now detach the N-2 element from teh N-1 
one and T9 will detach the N-1 element from the list. All is safe.

About the removed elements
--------------------------

Once a thread has detached some elements which counters are equal to 0, what 
should we do with them ? As we will update the disk with the removed pages, we 
have to send them to teh Writter thread. This can be done by pushing teminto a 
deque (and more specifically a
ConcurrentLinkedDeque)

Another option is that the writer thread handle the cleaning of this queue when 
it is weaken up or periodically. In this case, we don't need to push the 
element sinto a deque, and the reader threads will just set the counter to 0.

Second case
-----------

What if we accept the removal of elements from the middle of the list ?
This is way more complex, as we have to be sure we don't break the chain. One 
way to get rid of this constraint is to enforce the writer thread to do the 
job, by periodically checing the whole list (which shuld not be too costly). In 
this case, the reader thread never update the element's parent's next pointer, 
they just decrement teh counter.
When this counter becomes 0, then we can remove the element.

Solution
========

We will separate the threads into 2 categories :
- readers (we may have many)
- writer (we have only one)

The readers always peek the latest revision (which is the head of the list). It 
then increment the counter. When teh thread does not need the revision anymore, 
the counter is decremented. This is the only action the reader thread will ever 
do on the elements and on the list, but it will always keep a reference to the 
element into the session.

The writer thread has a lot more to do :
- it is responsible for adding element at the top of the list (see upper). Once 
the head is swapped, every new reader will get the added element as a reference.
- it also has to check in the list for the elements which counter is 0, if teh 
list is longer than 1 element. This check can be done after each update, or 
periodically (every N seconds if the number of elements is > 1, or after M 
additions) if tht help saving updates on disk.

Referencing pages
-----------------

All the B-tree revisions are kept into the BoB B-tree, so it's easy to retrieve 
the rootPage for a given revision.

As all the pages can't be hold in memory, we sometime have to reach a page from 
disk or from the cache. This is done following this simple algorithm :

  for a given page with offset O
    if the cache contains an instance for O
      return the page
    else
      fetch the page with offset O from disk
      store it in the cache

In our case, we just add a layer, which will check in the working memory for 
the instance of a page, given its offset.

RE: ApacheDS with Mavibot anytime soon?

Reply via email to