Re: ApacheDS with Mavibot anytime soon?

Emmanuel Lécharny Sun, 27 Sep 2015 10:24:58 -0700

Le 23/09/15 21:41, [email protected] a écrit :
> Hi,

hi,


sorry for the delayed response ! I saw your mail 3 days ago, but I
forgot to answer back then, being caught in a storm of delaying events...
>
> We have one customer that wants to get replication implemented. We've set it 
> up in house and it seems to work well.
> The trouble is there are 10's of thousands of users and all sorts of 
> concurrent activity. They're on M16 and a few times per month, we get the 
> 'ERR_554 double get for block' and then we need to restore.
> We have an automated way of doing this now so it's not too disruptive when it 
> happens.
Hopefully. Still I understand this is not convenient.
>
> My question is, would adding replication just add more concurrent activity to 
> the servers and lead more frequently to the ERR_554 situation?
For the consumer, it will be just as if the updates were coming from a
standard user. So, no, I don't think it's a real problem.
>
> Also he's asked to upgrade to M20 but my hunch is that for this issue it 
> won't matter much because we're still on JDBM?

True. It's just that some bugs have been fixed in M20.
>
> Finally, we're eager to test out a release with the Mavibot backed. I'm sure 
> it's possible to build ourselves but we're trying to keep the moving parts to 
> a minimum
> and use only ApacheDS releases.
I understand.

Mavibot work has been delayed, and we are working on it as much as we
can, but it's not enough. Kiran and I have day jobs atm that forbid us
to dedicate as much time as would be necessary to complete the last part
that is mandatory to get this problem fixed : the transaction system. We
restarted to 'think' about the best implementation 2 weeks ago (on your
free time) and we expect to be able to spend more time on it next week,
as we will both be at Budapest, during the Apache Conference.

Transaction  support (the way we need it, ie cross-btree) is not that
simple to impelment. We have a pretty clear idea on how it should work,
but that mpacts the current design, as we need to woirk in memory, and
avoid to copy pages that have already been modifed in the transaction
boundaries. We also have to deal with the cleanup of old versions, which
also means we need to implement the version manager.

To give you a insight on what I'm coming for, here is a draft of my
thoughts about this transaction manager (note that this is my vision,
that I haven't shared yet, because I don't think it's covering all the
moving parts, but still, I think it won't be that different to what we
will end with ) :

Transaction support
-------------------

Mavibot must support transaction across many B-tres (ie, whatever the
number of b-trees we are updating, we should wait before committing the
changes up to the point the transaction is closed). That means we may
have many updates pending in memory. In any case, if a rollback is done,
the modified b-trees will remain intact, in the same state they were
before the transaction started.

To achive this, we need to work in a working memory. As we may have to
fetch pages that are to be updated, we will face three cases :
- the page is not in memory :
we have to fetch them from disk, update them and store it in the working
memory we can read it back if needed
- the page is in the cache :
we have to copy it to the working memory
- the page is in the working memory : we simply update it, leaving it in
the working memory.

So if we have to modify the same page many times, it will be updated in
the working memory, we won't need to copy it.

That actually change the current Mavibot implementation, as it's copying
pages no matter what. Here, if the page is coming from the working
memory, we just update it.


Layout
------

 Disk            Cache           Working
  ___                            Memory
 /   \          .------.       
+     +        /      /|         .------.
|\___/|       .------. |       .------. |
+     +       |      | |       |      | |
|\___/| ----> |      | | ----> |      | |
+     +       |      |/        |      |/
|\___/|       .------.         .------.
+     +
 \___/
 
 When the transaction is rollbacked, we simply delete the content of the
working memory.
 When the transaction is committed, we write all the pages in the
working memory on disk, and into the cache, we update the header, and we
purge the working memory.
 
 If a crash occurs during a transaction, then either the transaction is
rollbacked (if the process is still running) - which is just about
purging the working memory -, or there will be nothing to do if we have
to restart the application, as the modified pages were in memory nad
have been removed while the process crashed.

OTOH, when we read, we must be sure that the B-trees are present, and
that we always use the B-tree revisions that are correlated. This is
critical when using LDAP, where many B-trees are updated during a sinlge
write.

The way to do that is to use a revision, and always use the B-trees
which revision is just below or equal to this revision. Let's see with
an example

 +-----+---+---+---+---+---+---+---+...
 |\ Rev|   |   |   |   |   |   |   |
 | \   |   |   |   |   |   |   |   |
 |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
 |B- \ |   |   |   |   |   |   |   |
 |tree\|   |   |   |   |   |   | L | N
 +-----+---+---+---+---+---+---+---+...
 | aaa | X |   |   | X | X | X |   |
 |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 | bbb | X | X | X |   | X | X | X |
 |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ccc | X |   | X | X | X |   | X |
 |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ddd | X | X |   |   | X |   |   |
 |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
 +-----+---+---+---+---+---+---+---+...
 | eee | X |   | X | X |   | X |   |
 |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 
L : Latest revision
N : New revision
 
In this table, we have had 7 updates, and 7 revsions. As all the B-trees
haven't been updated, we may have some 'holes' in the revision
associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4
revisions (1, 4, 5 and 6) created, because this B-tree has been updated
4 times.
 
Doing a read on revision 7 will use the following B-trees : aaa[6],
bbb[7], ccc[7], ddd[5], eee[6].


All in all, a read is associated to a global transaction revision which
is used as a marker. Every B-tree update is done using the new revision.

Managing on-going reads
-----------------------

We maintain all the B-tree revisions associated with a read, until the
read is completed, we need to get rid of the old revision when we are
done. This is critical to keep the database size to a minimum. In order
to do that, we must know if an older revision is still in use. We then
need to keep a common data structure that is shared across threads,
which will hold all the used revisions, ordered. With the previous
example, assuming revisions 2, 4 and 6 are in use, we may use a linked
list for that purpose :

   +---+   +---+   +---+
o--| 2 |<--| 4 |<--| 6 |
   +--2+   +--1+   +--3+
  
When the read using rev 4 is done, we remove it from this list :
          +-------------+
   +---+  |    +---+    | +---+
o--| 2 |<-+  X-| 4 |    +-| 6 |
   +--2+       +--0+      +--3+
                
The two remaining revisions are still in use, and one of them is older.
We can clean the B-tree pages associated with revision 4 if there is
already a new revision for this B-tree, and for the pages that have been
copied.

When the read using rev 2 is done, we can clean all the B-tree pages
that are associated with every revision below 6, as soon as there is a
B-tree revision above the oldest one.

An other important point : each revision might be used more than once.
Releasing a read will decrement the count, and when it goes down to 0,
we can safely remove the element from the list. We need a thread-safe
data structure for that.

Managing the list
-----------------

let's go a bit deeper in the management of this list. It has interesting
characteristics :

- elements are added by a unique thread : the writter
- elements are not removed unless the count becomes 0
- the count is equal to the maximum number of thread that can use the
element at some point
- it's not because the counter is N that N thread *are* using this
element : we may eventually have no thread using it ! (but they *may*
come back later...)
- each element must be be protected agains concurrent update (which is a
different requirement than protecting the full list)
- any thread requiring an element in the list will always use the head
of the least (which is the most recent)
- we don't do any list traversal ! Once a thread uses an element, it
remembers about its position
- no user Thread is allowed to remove an element from this list. The
only thread that can do that is the Cleaner Thread


The last characteristic is interesting : that means we can safely
depends on one single thread to do the cleaning, so there is no
concurrent access on teh list while deleting elements.

The deletion can be done in two ways :
- from the tail of the list, if the element's counter is equal to zero
(and we can iterate)
- from the inner of the list, when the element's counter to be removed
is zero

The first way is simpler, but it holds potentially a lot of element if
the oldest one is there for a long time
The second way implies we can relink elements in the list, ie this is a
double-linked list

Let's assume we go for the first way. In any case, the list will
*always* have at least one element, even if its counter is zero :

   head
   +---+
   | N | revision
   +---+
   | 0 | users counter (AtomicInteger)
   +---+
   |   |--o
   +---+
o--|   |
   +---+
  
A thread wanting to use revision N will simply increment the counter,
which is an AtomicInteger, guaranteing that we can't have a wrong update
of this value. The revision will never change.
At this point, using the head is quite straight-forward.

* Adding an element

So the writter creates a new revision. This will create a new element :

    new      head
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
The Head should always be pointing to one single cell, and any thread
should be able to get it. This is easy to do if we use an AtomicReference :

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
Swapping the head is done after we have updated the links, which is done
by the writter thread :

Step 1 :

Attach the head to the new element

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
  
Step 2 :

Attach the new element to the head (Here, at the same time, a new thread
needs the latest revison (still N))


    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 3 :

Swap the head

   [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 4 :

A new thread need the latest revision


           +--- T8
   [head]  |
   +---+   | +---+
   |N+1|<<-+ | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 1 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
 
At this point, no other thread will *ever* be attached to revision N

This can go on an on, with new revisions attached to the list by the
writter thread. Now, let's consider two different use case :
1) when the oldest revision has a counter going down to zero
2) when a revision which is not the head nor the oldest that is going to
zero

First case, a first approach :

This is the simplest. We have to remove the element from the lists, and
check if its parent's counter is also zero and is not the head. If so,
we iterate until we meet an element which counter is not zero and is not
the head.

In any case, we simply have to change its parent's next pointer, and to
remove the element from the list :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

The thread that set the counter to 0

Step 1 :

Detach the last element parent's next pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 2 :
Iterate if the parent's counter is 0

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |--o  |   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 3 :
Detach the last element which counter is 0 prev pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   |     +---+     +---+
   |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
   +---+     +---+         +---+     +---+
   | 1 |     | 4 |         | 0 |     | 0 |
   +---+     +---+         +---+     +---+
   |   |---->|   |--o      |   |--o  |   |--o
   +---+     +---+         +---+     +---+
o--|   |<----|   |      o--|   |<----|   |
   +---+     +---+         +---+     +---+
 
Step 4 :

 We can now dispose the unlinked elements.

What if a thread is freeing an element while we are already free another
one ? This can be exposed using the previous sample :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 1 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 0+

A thread (10) release the N-1 revison. We have two possibilities here :

1) The N-2 parent's next is not yet detached


           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

We just continue walking the list up to the first element with a non
zero counter

2) The N-2 paren'ts next is already detached



           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |         +-- T10
   +---+   | +---+   | +---+   | +---+
   |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+
  
In this case, we will have 2 threads dealing with two elements to be
cleaned up. This is not a problem, as T9 will now detach the N-2 element
from teh N-1 one and T9 will detach the N-1 element from the list. All
is safe.

About the removed elements
--------------------------

Once a thread has detached some elements which counters are equal to 0,
what should we do with them ? As we will update the disk with the
removed pages, we have to send them to teh Writter thread. This can be
done by pushing teminto a deque (and more specifically a
ConcurrentLinkedDeque)

Another option is that the writer thread handle the cleaning of this
queue when it is weaken up or periodically. In this case, we don't need
to push the element sinto a deque, and the reader threads will just set
the counter to 0.

Second case
-----------

What if we accept the removal of elements from the middle of the list ?
This is way more complex, as we have to be sure we don't break the
chain. One way to get rid of this constraint is to enforce the writer
thread to do the job, by periodically checing the whole list (which
shuld not be too costly). In this case, the reader thread never update
the element's parent's next pointer, they just decrement teh counter.
When this counter becomes 0, then we can remove the element.

Solution
========

We will separate the threads into 2 categories :
- readers (we may have many)
- writer (we have only one)

The readers always peek the latest revision (which is the head of the
list). It then increment the counter. When teh thread does not need the
revision anymore, the counter is decremented. This is the only action
the reader thread will ever do on the elements and on the list, but it
will always keep a reference to the element into the session.

The writer thread has a lot more to do :
- it is responsible for adding element at the top of the list (see
upper). Once the head is swapped, every new reader will get the added
element as a reference.
- it also has to check in the list for the elements which counter is 0,
if teh list is longer than 1 element. This check can be done after each
update, or periodically (every N seconds if the number of elements is >
1, or after M additions) if tht help saving updates on disk.
 
Referencing pages
-----------------

All the B-tree revisions are kept into the BoB B-tree, so it's easy to
retrieve the rootPage for a given revision.
 
As all the pages can't be hold in memory, we sometime have to reach a
page from disk or from the cache. This is done following this simple
algorithm :

  for a given page with offset O
    if the cache contains an instance for O
      return the page
    else
      fetch the page with offset O from disk
      store it in the cache
     
In our case, we just add a layer, which will check in the working memory
for the instance of a page, given its offset.

Re: ApacheDS with Mavibot anytime soon?

Reply via email to