Modified: hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml URL: http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml?rev=911716&r1=911715&r2=911716&view=diff ============================================================================== --- hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml (original) +++ hadoop/zookeeper/trunk/src/docs/src/documentation/content/xdocs/bookkeeperOverview.xml Fri Feb 19 07:02:06 2010 @@ -42,14 +42,82 @@ </abstract> </articleinfo> <section id="bk_Overview"> - <title>BookKeeper overview</title> + <title>BookKeeper overview</title> + + <section id="bk_Intro"> + <title>BookKeeper introduction</title> + <para> + BookKeeper is a replicated service to reliably log streams of records. In BookKeeper, + servers are "bookies", log streams are "ledgers", and each unit of a log (aka record) is a + "ledger entry". BookKeeper is designed to be reliable; bookies, the servers that store + ledgers, can crash, corrupt data, discard data, but as long as there are enough bookies + behaving correctly the service as a whole behaves correctly. + </para> - <para>This document explains basic concepts of BookKeeper. We start by discussing - the basic elements of BookKeeper, and next we discuss how they work together. + <para> + The initial motivation for BookKeeper comes from the namenode of HDFS. Namenodes have to + log operations in a reliable fashion so that recovery is possible in the case of crashes. + We have found the applications for BookKeeper extend far beyond HDFS, however. Essentially, + any application that requires an append storage can replace their implementations with + BookKeeper. BookKeeper has the advantage of scaling throughput with the number of servers. </para> + + <para> + At a high level, a bookkeeper client receives entries from a client application and stores it to + sets of bookies, and there are a few advantages in having such a service: + </para> + + <itemizedlist> + <listitem> + <para> + We can use hardware that is optimized for such a service. We currently believe that such a + system has to be optimized only for disk I/O; + </para> + </listitem> + + <listitem> + <para> + We can have a pool of servers implementing such a log system, and shared among a number of servers; + </para> + </listitem> + + <listitem> + <para> + We can have a higher degree of replication with such a pool, which makes sense if the hardware necessary for it is cheaper compared to the one the application uses. + </para> + </listitem> + </itemizedlist> + + </section> + + <section id="bk_moreDetail"> + <title>In slightly more detail...</title> + <para> BookKeeper implements highly available logs, and it has been designed with write-ahead logging in mind. Besides high availability + due to the replicated nature of the service, it provides high throughput due to striping. As we write entries in a subset of bookies of an + ensemble and rotate writes across available quorums, we are able to increase throughput with the number of servers for both reads and writes. + Scalability is a property that is possible to achieve in this case due to the use of quorums. Other replication techniques, such as + state-machine replication, do not enable such a property. + </para> + + <para> An application first creates a ledger before writing to bookies through a local BookKeeper client instance. + Upon creating a ledger, a BookKeeper client writes metadata about the ledger to ZooKeeper. Each ledger currently + has a single writer. This writer has to execute a close ledger operation before any other client can read from it. + If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the + opportunity of closing the ledger, then the next client that tries to open a ledger executes a procedure to recover + it. As closing a ledger consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery + procedure simply finds the last entry written correctly and writes it to ZooKeeper. + </para> + + <para> + Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. + Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode + for the ledger. + </para> + </section> + <section id="bk_basicComponents"> - <title>Basic elements</title> + <title>Bookkeeper elements and concepts</title> <para> BookKeeper uses four basic elements: </para> @@ -87,42 +155,265 @@ </itemizedlist> </section> - <section id="bk_moreDetail"> - <title>In slightly more detail...</title> + <section id="bk_initialDesign"> + <title>Bookkeeper initial design</title> + <para> + A set of bookies implements BookKeeper, and we use a quorum-based protocol to replicate data across the bookies. + There are basically two operations to an existing ledger: read and append. Here is the complete API list + (mode detail <ulink url="bookkeeperProgrammer.html"> + here</ulink>): + </para> + + <itemizedlist> + <listitem> + <para> + Create ledger: creates a new empty ledger; + </para> + </listitem> - <para> BookKeeper implements highly available logs, and it has been designed with write-ahead logging in mind. Besides high availability - due to the replicated nature of the service, it provides high throughput due to striping. As we write entries in a subset of bookies of an - ensemble and rotate writes across available quorums, we are able to increase throughput with the number of servers for both reads and writes. - Scalability is a property that is possible to achieve in this case due to the use of quorums. Other replication techniques, such as - state-machine replication, do not enable such a property. - </para> + <listitem> + <para> + Open ledger: opens an existing ledger for reading; + </para> + </listitem> + + <listitem> + <para> + Add entry: adds a record to a ledger either synchronously or asynchronously; + </para> + </listitem> - <para> An application first creates a ledger before writing to bookies through a local BookKeeper client instance. To - create a ledger, an application has to specify which kind of ledger it wants to use: self-verifying or generic. Self-verifying - includes a digest on every entry, which enables a reduction on the degree of replication. Generic ledgers do not store a digest - along with entries at the cost of using more bookies. + <listitem> + <para> + Read entries: reads a sequence of entries from a ledger either synchronously or asynchronously + </para> + </listitem> + </itemizedlist> + + <para> + There is only a single client that can write to a ledger. Once that ledger is closed or the client fails, + no more entries can be added. (We take advantage of this behavior to provide our strong guarantees.) + There will not be gaps in the ledger. Fingers get broken, people get roughed up or end up in prison when + books are manipulated, so there is no deleting or changing of entries. </para> + + <figure> + <title>BookKeeper Overview</title> - <para> Upon creating a ledger, a BookKeeper clients writes metadata about the ledger to ZooKeeper. A given client first creates - a znode named "L" as a child of "/ledger" with the SEQUENCE flag. ZooKeeper consequently assigns a unique sequence number to the - node, naming the node "/Lx", where x is the sequence number assigned. We use this sequence number as the identifier of the ledger. - This identifier is necessary when opening a ledger. We also store the ensemble composition so that readers know which set of bookies - of access for a given ledger. + <mediaobject> + <imageobject> + <imagedata fileref="images/bk-overview.jpg" width="3in" depth="3in" contentwidth="3in" contentdepth="3in" scalefit="0"/> + </imageobject> + </mediaobject> + </figure> + + <para> + A simple use of BooKeeper is to implement a write-ahead transaction log. A server maintains an in-memory data structure + (with periodic snapshots for example) and logs changes to that structure before it applies the change. The application + server creates a ledger at startup and store the ledger id and password in a well known place (ZooKeeper maybe). When + it needs to make a change, the server adds an entry with the change information to a ledger and apply the change when + BookKeeper adds the entry successfully. The server can even use asyncAddEntry to queue up many changes for high change + throughput. BooKeeper meticulously logs the changes in order and call the completion functions in order. + </para> + + <para> + When the application server dies, a backup server will come online, get the last snapshot and then it will open the + ledger of the old server and read all the entries from the time the snapshot was taken. (Since it doesn't know the + last entry number it will use MAX_INTEGER). Once all the entries have been processed, it will close the ledger and + start a new one for its use. </para> <para> - Each ledger currently has a single writer. This writer has to execute a close ledger operation before any other client can read - from it. If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the - opportunity of closing the ledger, then the next client that tries to open a ledger executes an procedure to recover it. As closing a ledger - consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery procedure simply finds the last entry - written correctly and writes it to ZooKeeper in the form of a close znode as a child of "/Lx", where x is the identifier of the ledger. + A client library takes care of communicating with bookies and managing entry numbers. An entry has the following fields: </para> + + <table frame='all'><title>Entry fields</title> + <tgroup cols='3' align='left' colsep='1' rowsep='1'> + <colspec colname='Field'/> + <colspec colname='Type'/> + <colspec colname='Description'/> + <colspec colnum='5' colname='c5'/> + <thead> + <row> + <entry>Field</entry> + <entry>Type</entry> + <entry>Description</entry> + </row> + </thead> + <tfoot> + <row> + <entry>Ledger number</entry> + <entry>long</entry> + <entry>The id of the ledger of this entry</entry> + </row> + <row> + <entry>Entry number</entry> + <entry>long</entry> + <entry>The id of this entry</entry> + </row> + </tfoot> + <tbody> + <row> + <entry>last confirmed (<emphasis>LC</emphasis>)</entry> + <entry>long</entry> + <entry>id of the last recorded entry</entry> + </row> + <row> + <entry>data</entry> + <entry>byte[]</entry> + <entry>the entry data (supplied by application)</entry> + </row> + <row> + <entry>authentication code</entry> + <entry>byte[]</entry> + <entry>Message authentication code that includes all other fields of the entry</entry> + </row> + </tbody> + </tgroup> + </table> + <para> - Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. - Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode - for the ledger. - </para> - </section> + The client library generates a ledger entry. None of the fields are modified by the bookies and only the first three + fields are interpreted by the bookies. + </para> + + <para> + To add to a ledger, the client generates the entry above using the ledger number. The entry number will be one more + than the last entry generated. The <emphasis>LC</emphasis> field contains the last entry that has been successfully recorded by BookKeeper. + If the client writes entries one at a time, <emphasis>LC</emphasis> is the last entry id. But, if the client is using asyncAddEntry, there + may be many entries in flight. An entry is considered recorded when both of the following conditions are met: + </para> + + <itemizedlist> + <listitem> + <para> + the entry has been accepted by a quorum of bookies + </para> + </listitem> + + <listitem> + <para> + all entries with a lower entry id have been accepted by a quorum of bookies + </para> + </listitem> + </itemizedlist> + + <para> + <emphasis>LC</emphasis> seems mysterious right now, but it is too early to explain how we use it; just smile and move on. + </para> + + <para> + Once all the other fields have been field in, the client generates an authentication code with all of the previous fields. + The entry is then sent to a quorum of bookies to be recorded. Any failures will result in the entry being sent to a new + quorum of bookies. + </para> + + <para> + To read, the client library initially contacts a bookie and starts requesting entries. If an entry is missing or + invalid (a bad MAC for example), the client will make a request to a different bookie. By using quorum writes, + as long as enough bookies are up we are guaranteed to eventually be able to read an entry. + </para> + + </section> + + <section id="bk_metadata"> + <title>Bookkeeper metadata management</title> + + <para> + There are some meta data that needs to be made available to BookKeeper clients: + </para> + + <itemizedlist> + <listitem> + <para> + The available bookies; + </para> + </listitem> + + <listitem> + <para> + The list of ledgers; + </para> + </listitem> + + <listitem> + <para> + The list of bookies that have been used for a given ledger; + </para> + </listitem> + + <listitem> + <para> + The last entry of a ledger; + </para> + </listitem> + </itemizedlist> + + <para> + We maintain this information in ZooKeeper. Bookies use ephemeral nodes to indicate their availability. Clients + use znodes to track ledger creation and deletion and also to know the end of the ledger and the bookies that + were used to store the ledger. Bookies also watch the ledger list so that they can cleanup ledgers that get deleted. + </para> + + </section> + + <section id="bk_closingOut"> + <title>Closing out ledgers</title> + + <para> + The process of closing out the ledger and finding the last ledger is difficult due to the durability guarantees of BookKeeper: + </para> + + <itemizedlist> + <listitem> + <para> + If an entry has been successfully recorded, it must be readable. + </para> + </listitem> + + <listitem> + <para> + If an entry is read once, it must always be available to be read. + </para> + </listitem> + </itemizedlist> + + <para> + If the ledger was closed gracefully, ZooKeeper will have the last entry and everything will work well. But, if the + BookKeeper client that was writing the ledger dies, there is some recovery that needs to take place. + </para> + + <para> + The problematic entries are the ones at the end of the ledger. There can be entries in flight when a BookKeeper client + dies. If the entry only gets to one bookie, the entry should not be readable since the entry will disappear if that bookie + fails. If the entry is only on one bookie, that doesn't mean that the entry has not been recorded successfully; the other + bookies that recorded the entry might have failed. + </para> + + <para> + The trick to making everything work is to have a correct idea of a last entry. We do it in roughly three steps: + </para> + <orderedlist> + <listitem> + <para> + Find the entry with the highest last recorded entry, <emphasis>LC</emphasis>; + </para> + </listitem> + + <listitem> + <para> + Find the highest consecutively recorded entry, <emphasis>LR</emphasis>; + </para> + </listitem> + + <listitem> + <para> + Make sure that all entries between <emphasis>LC</emphasis> and <emphasis>LR</emphasis> are on a quorum of bookies; + </para> + </listitem> + + </orderedlist> + </section> </section> </article> \ No newline at end of file
Added: hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg URL: http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg?rev=911716&view=auto ============================================================================== Binary file - no diff available. Propchange: hadoop/zookeeper/trunk/src/docs/src/documentation/resources/images/bk-overview.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream