svn commit: r911716 [3/3] - in /hadoop/zookeeper/trunk: ./ docs/ docs/images/ src/docs/src/documentation/content/xdocs/ src/docs/src/documentation/resources/images/

mahadev Thu, 18 Feb 2010 23:02:33 -0800

Modified: 
hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml
URL: 
http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml?rev=911716&r1=911715&r2=911716&view=diff
==============================================================================
--- 
hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml
 (original)
+++ 
hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml
 Fri Feb 19 07:02:06 2010
@@ -42,14 +42,82 @@
     </abstract>
   </articleinfo>
   <section id="bk_Overview">
-    <title>BookKeeper overview</title>
+  <title>BookKeeper overview</title>
+    
+  <section id="bk_Intro">
+    <title>BookKeeper introduction</title>
+       <para>
+       BookKeeper is a replicated service to reliably log streams of records. 
In BookKeeper, 
+       servers are "bookies", log streams are "ledgers", and each unit of a 
log (aka record) is a 
+       "ledger entry". BookKeeper is designed to be reliable; bookies, the 
servers that store 
+       ledgers, can crash, corrupt data, discard data, but as long as there 
are enough bookies 
+       behaving correctly the service as a whole behaves correctly.
+       </para>
 
-    <para>This document explains basic concepts of BookKeeper. We start by 
discussing
-    the basic elements of BookKeeper, and next we discuss how they work 
together. 
+       <para>
+    The initial motivation for BookKeeper comes from the namenode of HDFS. 
Namenodes have to 
+    log operations in a reliable fashion so that recovery is possible in the 
case of crashes. 
+    We have found the applications for BookKeeper extend far beyond HDFS, 
however. Essentially, 
+    any application that requires an append storage can replace their 
implementations with
+    BookKeeper. BookKeeper has the advantage of scaling throughput with the 
number of servers. 
     </para>
+       
+       <para>
+    At a high level, a bookkeeper client receives entries from a client 
application and stores it to
+    sets of bookies, and there are a few advantages in having such a service:
+       </para>
+
+       <itemizedlist>
+    <listitem>
+    <para>
+       We can use hardware that is optimized for such a service. We currently 
believe that such a
+       system has to be optimized only for disk I/O;
+    </para>
+    </listitem>
+    
+    <listitem>
+    <para>
+       We can have a pool of servers implementing such a log system, and 
shared among a number of servers;
+    </para>
+    </listitem>
+    
+    <listitem>
+    <para>
+       We can have a higher degree of replication with such a pool, which 
makes sense if the hardware necessary for it is cheaper compared to the one the 
application uses. 
+       </para>
+       </listitem>
+       </itemizedlist>
+
+       </section>
+       
+       <section id="bk_moreDetail">
+    <title>In slightly more detail...</title>
     
+    <para> BookKeeper implements highly available logs, and it has been 
designed with write-ahead logging in mind. Besides high availability
+    due to the replicated nature of the service, it provides high throughput 
due to striping. As we write entries in a subset of bookies of an
+    ensemble and rotate writes across available quorums, we are able to 
increase throughput with the number of servers for both reads and writes. 
+    Scalability is a property that is possible to achieve in this case due to 
the use of quorums. Other replication techniques, such as 
+    state-machine replication, do not enable such a property. 
+    </para> 
+    
+       <para> An application first creates a ledger before writing to bookies 
through a local BookKeeper client instance.   
+       Upon creating a ledger, a BookKeeper client writes metadata about the 
ledger to ZooKeeper. Each ledger currently 
+       has a single writer. This writer has to execute a close ledger 
operation before any other client can read from it. 
+       If the writer of a ledger does not close a ledger properly because, for 
example, it has crashed before having the 
+       opportunity of closing the ledger, then the next client that tries to 
open a ledger executes a procedure to recover
+       it. As closing a ledger consists essentially of writing the last entry 
written to a ledger to ZooKeeper, the recovery
+       procedure simply finds the last entry written correctly and writes it 
to ZooKeeper.     
+       </para>
+               
+       <para>
+       Note that currently this recovery procedure is executed automatically 
upon trying to open a ledger and no explicit action is necessary. 
+       Although two clients may try to recover a ledger concurrently, only one 
will succeed, the first one that is able to create the close znode
+       for the ledger.
+       </para> 
+       </section>  
+          
     <section id="bk_basicComponents">
-    <title>Basic elements</title>
+    <title>Bookkeeper elements and concepts</title>
        <para> 
        BookKeeper uses four basic elements:
        </para>
@@ -87,42 +155,265 @@
     </itemizedlist>
     </section>
     
-    <section id="bk_moreDetail">
-    <title>In slightly more detail...</title>
+    <section id="bk_initialDesign">
+    <title>Bookkeeper initial design</title>
+    <para>
+    A set of bookies implements BookKeeper, and we use a quorum-based protocol 
to replicate data across the bookies. 
+    There are basically two operations to an existing ledger: read and append. 
Here is the complete API list 
+    (mode detail <ulink url="bookkeeperProgrammer.html">
+             here</ulink>):
+       </para>
+       
+       <itemizedlist>
+       <listitem>
+       <para>
+       Create ledger: creates a new empty ledger; 
+    </para>
+    </listitem>
     
-    <para> BookKeeper implements highly available logs, and it has been 
designed with write-ahead logging in mind. Besides high availability
-    due to the replicated nature of the service, it provides high throughput 
due to striping. As we write entries in a subset of bookies of an
-    ensemble and rotate writes across available quorums, we are able to 
increase throughput with the number of servers for both reads and writes. 
-    Scalability is a property that is possible to achieve in this case due to 
the use of quorums. Other replication techniques, such as 
-    state-machine replication, do not enable such a property. 
-    </para> 
+    <listitem>
+       <para>
+       Open ledger: opens an existing ledger for reading;
+    </para>
+    </listitem>
+    
+    <listitem>
+       <para>
+       Add entry: adds a record to a ledger either synchronously or 
asynchronously;
+    </para>
+    </listitem>
     
-       <para> An application first creates a ledger before writing to bookies 
through a local BookKeeper client instance. To 
-       create a ledger, an application has to specify which kind of ledger it 
wants to use: self-verifying or generic. Self-verifying
-       includes a digest on every entry, which enables a reduction on the 
degree of replication. Generic ledgers do not store a digest
-       along with entries at the cost of using more bookies.   
+    <listitem>
+       <para>
+    Read entries: reads a sequence of entries from a ledger either 
synchronously or asynchronously 
+       </para>
+    </listitem>
+       </itemizedlist>
+
+       <para>
+       There is only a single client that can write to a ledger. Once that 
ledger is closed or the client fails, 
+       no more entries can be added. (We take advantage of this behavior to 
provide our strong guarantees.) 
+       There will not be gaps in the ledger. Fingers get broken, people get 
roughed up or end up in prison when
+       books are manipulated, so there is no deleting or changing of entries.
        </para>
+
+       <figure>
+        <title>BookKeeper Overview</title>
        
-       <para> Upon creating a ledger, a BookKeeper clients writes metadata 
about the ledger to ZooKeeper. A given client first creates
-       a znode named "L" as a child of "/ledger" with the SEQUENCE flag. 
ZooKeeper consequently assigns a unique sequence number to the 
-       node, naming the node "/Lx", where x is the sequence number assigned. 
We use this sequence number as the identifier of the ledger. 
-       This identifier is necessary when opening a ledger. We also store the 
ensemble composition so that readers know which set of bookies
-       of access for a given ledger.   
+               <mediaobject>
+        <imageobject>
+            <imagedata fileref="images/bk-overview.jpg" width="3in" 
depth="3in" contentwidth="3in" contentdepth="3in" scalefit="0"/>
+        </imageobject>
+        </mediaobject>
+    </figure>
+
+       <para>
+    A simple use of BooKeeper is to implement a write-ahead transaction log. A 
server maintains an in-memory data structure
+    (with periodic snapshots for example) and logs changes to that structure 
before it applies the change. The application 
+    server creates a ledger at startup and store the ledger id and password in 
a well known place (ZooKeeper maybe). When 
+    it needs to make a change, the server adds an entry with the change 
information to a ledger and apply the change when 
+    BookKeeper adds the entry successfully. The server can even use 
asyncAddEntry to queue up many changes for high change
+    throughput. BooKeeper meticulously logs the changes in order and call the 
completion functions in order.
+       </para>
+
+       <para>
+    When the application server dies, a backup server will come online, get 
the last snapshot and then it will open the 
+    ledger of the old server and read all the entries from the time the 
snapshot was taken. (Since it doesn't know the 
+    last entry number it will use MAX_INTEGER). Once all the entries have been 
processed, it will close the ledger and 
+    start a new one for its use. 
        </para>
        
        <para>
-       Each ledger currently has a single writer. This writer has to execute a 
close ledger operation before any other client can read
-       from it. If the writer of a ledger does not close a ledger properly 
because, for example, it has crashed before having the 
-       opportunity of closing the ledger, then the next client that tries to 
open a ledger executes an procedure to recover it. As closing a ledger
-       consists essentially of writing the last entry written to a ledger to 
ZooKeeper, the recovery procedure simply finds the last entry
-       written correctly and writes it to ZooKeeper in the form of a close 
znode as a child of "/Lx", where x is the identifier of the ledger.     
+       A client library takes care of communicating with bookies and managing 
entry numbers. An entry has the following fields:
        </para>
+
+       <table frame='all'><title>Entry fields</title>
+       <tgroup cols='3' align='left' colsep='1' rowsep='1'>
+       <colspec colname='Field'/>
+       <colspec colname='Type'/>
+       <colspec colname='Description'/>
+       <colspec colnum='5' colname='c5'/>
+       <thead>
+       <row>
+       <entry>Field</entry>
+       <entry>Type</entry>
+       <entry>Description</entry>
+       </row>
+       </thead>
+       <tfoot>
+       <row>
+       <entry>Ledger number</entry>
+       <entry>long</entry>
+       <entry>The id of the ledger of this entry</entry>
+       </row>
+       <row>
+       <entry>Entry number</entry>
+       <entry>long</entry>
+       <entry>The id of this entry</entry>
+       </row>
+       </tfoot>
+       <tbody>
+       <row>
+       <entry>last confirmed (<emphasis>LC</emphasis>)</entry>
+       <entry>long</entry>
+       <entry>id of the last recorded entry</entry>
+       </row>
+       <row>
+       <entry>data</entry>
+       <entry>byte[]</entry>
+       <entry>the entry data (supplied by application)</entry>
+       </row>
+       <row>
+       <entry>authentication code</entry>
+       <entry>byte[]</entry>
+       <entry>Message authentication code that includes all other fields of 
the entry</entry>
+       </row>
        
+       </tbody>
+       </tgroup>
+       </table>
+
        <para>
-       Note that currently this recovery procedure is executed automatically 
upon trying to open a ledger and no explicit action is necessary. 
-       Although two clients may try to recover a ledger concurrently, only one 
will succeed, the first one that is able to create the close znode
-       for the ledger.
-       </para> 
-       </section>  
+       The client library generates a ledger entry. None of the fields are 
modified by the bookies and only the first three 
+       fields are interpreted by the bookies.
+       </para>
+
+       <para>
+       To add to a ledger, the client generates the entry above using the 
ledger number. The entry number will be one more 
+       than the last entry generated. The <emphasis>LC</emphasis> field 
contains the last entry that has been successfully recorded by BookKeeper. 
+       If the client writes entries one at a time, <emphasis>LC</emphasis> is 
the last entry id. But, if the client is using asyncAddEntry, there 
+       may be many entries in flight. An entry is considered recorded when 
both of the following conditions are met:
+       </para>
+
+       <itemizedlist>
+       <listitem>
+    <para>
+       the entry has been accepted by a quorum of bookies
+    </para>
+    </listitem>
+    
+    <listitem>
+    <para>
+       all entries with a lower entry id have been accepted by a quorum of 
bookies 
+       </para>
+       </listitem>
+    </itemizedlist>
+    
+    <para>
+       <emphasis>LC</emphasis> seems mysterious right now, but it is too early 
to explain how we use it; just smile and move on.
+       </para>
+       
+       <para>
+       Once all the other fields have been field in, the client generates an 
authentication code with all of the previous fields. 
+       The entry is then sent to a quorum of bookies to be recorded. Any 
failures will result in the entry being sent to a new
+       quorum of bookies.
+       </para>
+
+       <para>
+       To read, the client library initially contacts a bookie and starts 
requesting entries. If an entry is missing or 
+       invalid (a bad MAC for example), the client will make a request to a 
different bookie. By using quorum writes, 
+       as long as enough bookies are up we are guaranteed to eventually be 
able to read an entry.
+       </para>
+       
+       </section>
+
+       <section id="bk_metadata">
+    <title>Bookkeeper metadata management</title>
+
+       <para>
+       There are some meta data that needs to be made available to BookKeeper 
clients:
+       </para>
+       
+       <itemizedlist>
+       <listitem>
+       <para>
+               The available bookies;
+       </para>
+       </listitem>
+       
+       <listitem>
+       <para>
+       The list of ledgers;
+    </para>
+    </listitem>
+    
+    <listitem>
+       <para>
+       The list of bookies that have been used for a given ledger;
+    </para>
+    </listitem>
+    
+    <listitem>
+       <para>
+       The last entry of a ledger; 
+       </para>
+       </listitem>
+       </itemizedlist>
+       
+       <para>
+       We maintain this information in ZooKeeper. Bookies use ephemeral nodes 
to indicate their availability. Clients 
+       use znodes to track ledger creation and deletion and also to know the 
end of the ledger and the bookies that 
+       were used to store the ledger. Bookies also watch the ledger list so 
that they can cleanup ledgers that get deleted.
+       </para>
+       
+       </section>
+
+       <section id="bk_closingOut">
+    <title>Closing out ledgers</title>
+
+       <para>
+       The process of closing out the ledger and finding the last ledger is 
difficult due to the durability guarantees of BookKeeper:
+       </para>
+
+       <itemizedlist>
+       <listitem>
+       <para>
+       If an entry has been successfully recorded, it must be readable.
+    </para>
+    </listitem>
+    
+    <listitem>
+       <para>
+       If an entry is read once, it must always be available to be read. 
+       </para>
+       </listitem>
+       </itemizedlist>
+       
+       <para>
+       If the ledger was closed gracefully, ZooKeeper will have the last entry 
and everything will work well. But, if the 
+       BookKeeper client that was writing the ledger dies, there is some 
recovery that needs to take place.
+       </para>
+
+       <para>
+       The problematic entries are the ones at the end of the ledger. There 
can be entries in flight when a BookKeeper client 
+       dies. If the entry only gets to one bookie, the entry should not be 
readable since the entry will disappear if that bookie
+       fails. If the entry is only on one bookie, that doesn't mean that the 
entry has not been recorded successfully; the other
+       bookies that recorded the entry might have failed.
+       </para>
+       
+       <para>
+       The trick to making everything work is to have a correct idea of a last 
entry. We do it in roughly three steps:
+       </para>
+       <orderedlist>
+       <listitem>
+       <para>
+               Find the entry with the highest last recorded entry, 
<emphasis>LC</emphasis>;
+       </para>
+       </listitem>
+       
+       <listitem>
+       <para>
+               Find the highest consecutively recorded entry, 
<emphasis>LR</emphasis>;
+       </para>
+       </listitem>
+       
+       <listitem>
+       <para>
+               Make sure that all entries between <emphasis>LC</emphasis> and 
<emphasis>LR</emphasis> are on a quorum of bookies; 
+       </para>
+       </listitem>
+       
+       </orderedlist>
+    </section>
   </section>  
 </article>
\ No newline at end of file


Added: 
hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg
URL: 
http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg?rev=911716&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

svn commit: r911716 [3/3] - in /hadoop/zookeeper/trunk: ./ docs/ docs/images/ src/docs/src/documentation/content/xdocs/ src/docs/src/documentation/resources/images/

Reply via email to