Heh, nothing to be sorry about, thanks for feedback and for raising these
points, Kishore.

-Flavio

-----Original Message-----
From: kishore g [mailto:[email protected]] 
Sent: 09 July 2013 19:01
To: [email protected]
Subject: Re: Efficient backup and a reasonable restore of an ensemble

Sorry Flavio, I mixed two things in my previous email. When i said
checkpoint A, it means just save the last committed transaction id (No
snapshot will be taken). When we need to do restore we will simply run the
tool to bring the data directory to that particular zxid( We will truncate
the txn log after that zxid). We can now restart the server and we should
get back to that particular point.


The second part about fuzzy snapshot, I was just trying to explain to Sergey
that its not really fuzzy if he knows for sure that there are no updates
while taking snapshot. This really depends on the use case, for example if
all writes happen via a manually run tool then snapshot should not be fuzzy.





On Tue, Jul 9, 2013 at 9:02 AM, Sergey Maslyakov <[email protected]> wrote:

> I think I am having difficulties understanding the "fuzzy" concept. 
> Let's say I started to serialize DataTree into a snapshot file and it 
> took 30 seconds. During these 30 seconds, the server saw 5 
> transactions that updated the data. Does this mean that the snapshot 
> that I get on disk at the end of the 30-second interval will have some of
these 5 transactions?
> Or will it have none? Or will it have all of them? Or will it be 
> inconsistent and unreadable by Zookeeper?
>
> Please help me better understand the behavior behind the "fuzzy" term.
>
> For my use case, I am perfectly fine if I get a snapshot with none of 
> these
> 5 transactions, considering that I will pick them up next time I take 
> a snapshot.
>
>
> /Sergey
>
>
> On Tue, Jul 9, 2013 at 12:08 AM, kishore g <[email protected]> wrote:
>
> > Its not really elaborate, it is very similar to what zookeeper does 
> > when
> it
> > starts up. It first reads the latest snapshot file and then the
> transaction
> > logs and applies each and every transaction. What I am suggesting is 
> > that instead of applying all transactions stop at a transaction i
provide.
> >
> > Having this tool will actually simplify your task, you can go back 
> > to any point in time. Think of a something like this.
> >
> > checkpoint A // this can store the last zxid or timestamp from the
> leader.
> > Make changes to zk
> > //if things fails
> > stop zks
> > rollback A//run this on each zk, brings back the cluster to its 
> > previous state.
> > start zks // any order should be fine.
> >
> >
> > Also keep in mind that snapshot is fuzzy only if there are writes
> happening
> > while taking snapshot. If you are sure no writes will happen when 
> > you are taking the snapshot then you are good. Experts, please 
> > correct me if this is incorrect.
> >
> > thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <[email protected]>
> > wrote:
> >
> > > Kishore,
> > >
> > > This sounds like a very elaborate tool. I was trying to find a
> simplistic
> > > approach but what Thawan said about "fuzzy snapshots" makes me a 
> > > little afraid that there is no simple solution.
> > >
> > >
> > > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <[email protected]>
> wrote:
> > >
> > > > Agree, we already have such a tool. In fact we use it to 
> > > > reconstruct
> > the
> > > > sequence of events that led to a failure and actually restore 
> > > > the
> > system
> > > to
> > > > a previous stable point and replay the events. Unfortunately 
> > > > this is
> > tied
> > > > closely with Helix but it should be easy to make this a generic
tool.
> > > >
> > > > Sergey is this something that will be useful in your case.
> > > >
> > > > Thanks,
> > > > Kishore G
> > > >
> > > >
> > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <[email protected]>
> wrote:
> > > >
> > > > > On restore part, I think having a separate utility to 
> > > > > manipulate
> the
> > > > > data/snap dir (by truncating the log/removing snapshot to a 
> > > > > given
> > zxid)
> > > > > would be easier than modifying the server.
> > > > >
> > > > >
> > > > > --
> > > > > Thawan Kooburat
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 7/8/13 6:34 PM, "kishore g" <[email protected]> wrote:
> > > > >
> > > > > >I think what we are looking at is a  point in time restore
> > > > functionality.
> > > > > >How about adding a feature that says go back to a specific
> > > > zxid/timestamp.
> > > > > >This way before doing any change to zookeeper simply note 
> > > > > >down the timestamp/zxid on leader. If things go wrong after 
> > > > > >making changes,
> > > bring
> > > > > >down zookeepers and provide additional parameter of a
> zxid/timestamp
> > > > while
> > > > > >restarting. The server can go the exact point and make it
current.
> > The
> > > > > >followers can be started blank.
> > > > > >
> > > > > >
> > > > > >
> > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat 
> > > > > ><[email protected]>
> > > wrote:
> > > > > >
> > > > > >> Just saw that  this is the corresponding use case to the
> question
> > > > posted
> > > > > >> in dev list.
> > > > > >>
> > > > > >> In order to restore the data to a given point in time 
> > > > > >> correctly,
> > you
> > > > > >>need
> > > > > >> both snapshot and txnlog. This is because zookeeper 
> > > > > >>snapshot is
> > > fuzzy
> > > > > >>and
> > > > > >> snapshot alone may not represent a valid state of the 
> > > > > >>server if
> > > there
> > > > > >>are
> > > > > >> in-flight requests.
> > > > > >>
> > > > > >> The 4wl command should cause the server to roll the log and
> take a
> > > > > >> snapshot similar to periodic snapshotting operation. Your 
> > > > > >> backup
> > > > script
> > > > > >> need grap the snapshot and corresponding txnlog file from 
> > > > > >> the
> data
> > > > dir.
> > > > > >>
> > > > > >> To restore, just shutdown all hosts, clear the data dir, 
> > > > > >> copy
> over
> > > the
> > > > > >> snapshot and txnlog, and restart them.
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thawan Kooburat
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <[email protected]>
> wrote:
> > > > > >>
> > > > > >> >Thank you for your response, Flavio. I apologize, I did 
> > > > > >> >not
> > > provide a
> > > > > >> >clear
> > > > > >> >explanation of the use case.
> > > > > >> >
> > > > > >> >This backup/restore is not intended to be tied to any 
> > > > > >> >write
> > event,
> > > > > >> >instead,
> > > > > >> >it is expected to run as a periodic (daily?) cron job on 
> > > > > >> >one of
> > the
> > > > > >> >servers, which is not guaranteed to be the leader of the
> > ensemble.
> > > > > >>There
> > > > > >> >is
> > > > > >> >no expectation that all recent changes are committed and
> > persisted
> > > to
> > > > > >> >disk.
> > > > > >> >The system can sustain the loss of several hours worth of
> recent
> > > > > >>changes
> > > > > >> >in
> > > > > >> >the event of restore.
> > > > > >> >
> > > > > >> >As for finding the leader dynamically and performing 
> > > > > >> >backup on
> > it,
> > > > this
> > > > > >> >approach could be more difficult as the leader can change 
> > > > > >> >time
> to
> > > > time
> > > > > >>and
> > > > > >> >I still need to fetch the file to store it in my 
> > > > > >> >designated
> > backup
> > > > > >> >location. Taking backup on one server and picking it up 
> > > > > >> >from a
> > > local
> > > > > >>file
> > > > > >> >system looks less error-prone. Even if I went the fancy 
> > > > > >> >route
> and
> > > had
> > > > > >> >Zookeeper send me the serialized DataTree in response to 
> > > > > >> >the
> 4wl,
> > > > this
> > > > > >> >approach would involve a lot of moving parts.
> > > > > >> >
> > > > > >> >I have already made a PoC for a new 4wl that invokes
> > takeSnapshot()
> > > > and
> > > > > >> >returns an absolute path to the snapshot it drops on disk. 
> > > > > >> >I
> have
> > > > > >>already
> > > > > >> >protected takeSnapshot() from concurrent invocation, which 
> > > > > >> >is
> > > likely
> > > > to
> > > > > >> >corrupt the snapshot file on disk. This approach works but 
> > > > > >> >I'm
> > > > > >>thinking to
> > > > > >> >take it one step further by providing the desired path 
> > > > > >> >name as
> an
> > > > > >>argument
> > > > > >> >to my new 4lw and to have Zookeeper server drop the 
> > > > > >> >snapshot
> into
> > > the
> > > > > >> >specified file and report success/failure back. This way I 
> > > > > >> >can
> > > avoid
> > > > > >> >cluttering the data directory and interfering with what
> Zookeeper
> > > > finds
> > > > > >> >when it scans the data directory.
> > > > > >> >
> > > > > >> >Approach with having an additional server that would take 
> > > > > >> >the
> > > > > >>leadership
> > > > > >> >and populate the ensemble is just a theory. I don't see a 
> > > > > >> >clean
> > way
> > > > of
> > > > > >> >making a quorum member the leader of the quorum. Am I
> overlooking
> > > > > >> >something
> > > > > >> >simple?
> > > > > >> >
> > > > > >> >In backup and restore of an ensemble the biggest unknown 
> > > > > >> >for me
> > > > remains
> > > > > >> >populating the ensemble with desired data. I can think of 
> > > > > >> >two
> > ways:
> > > > > >> >
> > > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > > directories,
> > > > > >> >restore a snapshot file on one server that will be brought
> first,
> > > and
> > > > > >>then
> > > > > >> >bring up the rest of the ensemble. This way I somewhat 
> > > > > >> >force
> the
> > > > first
> > > > > >> >server to be the leader because it has data and it will be 
> > > > > >> >the
> > only
> > > > > >>member
> > > > > >> >of a quorum with data, provided to the way I start the
> ensemble.
> > > This
> > > > > >> >looks
> > > > > >> >like a hack, though.
> > > > > >> >
> > > > > >> >2. Clear out the ensemble and reload it with a dedicated 
> > > > > >> >client
> > > using
> > > > > >>the
> > > > > >> >provided Zookeeper API.
> > > > > >> >
> > > > > >> >With the approach of backing up an actual snapshot file, 
> > > > > >> >option
> > #1
> > > > > >>appears
> > > > > >> >to be more practical.
> > > > > >> >
> > > > > >> >I wish I could start the ensemble with a designate leader 
> > > > > >> >that
> > > would
> > > > > >> >bootstrap the ensemble with data and then the ensemble 
> > > > > >> >would go
> > > into
> > > > > >>its
> > > > > >> >normal business...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > > >> ><[email protected]>wrote:
> > > > > >> >
> > > > > >> >> One bit that is still a bit confusing to me in your use 
> > > > > >> >> case
> is
> > > if
> > > > > >>you
> > > > > >> >> need to take a snapshot right after some event in your
> > > application.
> > > > > >> >>Even if
> > > > > >> >> you're able to tell ZooKeeper to take a snapshot, there 
> > > > > >> >>is no
> > > > > >>guarantee
> > > > > >> >> that it will happen at the exact point you want it if 
> > > > > >> >> update
> > > > > >>operations
> > > > > >> >> keep coming.
> > > > > >> >>
> > > > > >> >> If you use your four-letter word approach, then would 
> > > > > >> >> you
> > search
> > > > for
> > > > > >>the
> > > > > >> >> leader or would you simply take a snapshot at any 
> > > > > >> >> server? If
> it
> > > has
> > > > > >>to
> > > > > >> >>go
> > > > > >> >> through the leader so that you make sure to have the 
> > > > > >> >>most
> > recent
> > > > > >> >>committed
> > > > > >> >> state, then it might not be a bad idea to have an api 
> > > > > >> >>call
> that
> > > > tells
> > > > > >> >>the
> > > > > >> >> leader to take a snapshot at some directory of your choice.
> > > > Informing
> > > > > >> >>you
> > > > > >> >> the name of the snapshot file so that you can copy 
> > > > > >> >>sounds
> like
> > an
> > > > > >> >>option,
> > > > > >> >> but perhaps it is not as convenient.
> > > > > >> >>
> > > > > >> >> The approach of adding another server is not very clear. 
> > > > > >> >> How
> do
> > > you
> > > > > >> >>force
> > > > > >> >> it to be the leader? Keep in mind that if it crashes, 
> > > > > >> >>then it
> > > will
> > > > > >>lose
> > > > > >> >> leadership.
> > > > > >> >>
> > > > > >> >> -Flavio
> > > > > >> >>
> > > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> > [email protected]>
> > > > > >>wrote:
> > > > > >> >>
> > > > > >> >> > It looks like the "dev" mailing list is rather inactive.
> Over
> > > the
> > > > > >>past
> > > > > >> >> few
> > > > > >> >> > days I only saw several automated emails from JIRA and 
> > > > > >> >> > this
> > is
> > > > > >>pretty
> > > > > >> >> much
> > > > > >> >> > it. Contrary to this, the "user" mailing list seems to 
> > > > > >> >> > be
> > more
> > > > > >>alive
> > > > > >> >>and
> > > > > >> >> > more populated.
> > > > > >> >> >
> > > > > >> >> > With this in mind, please allow me to cross-post here 
> > > > > >> >> > the
> > > > message I
> > > > > >> >>sent
> > > > > >> >> > into the "dev" list a few days ago.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > /Sergey
> > > > > >> >> >
> > > > > >> >> > === forwarded message begins here ===
> > > > > >> >> >
> > > > > >> >> > Hi!
> > > > > >> >> >
> > > > > >> >> > I'm facing the problem that has been raised by 
> > > > > >> >> > multiple
> > people
> > > > but
> > > > > >> >>none
> > > > > >> >> of
> > > > > >> >> > the discussion threads seem to provide a good answer. 
> > > > > >> >> > I dug
> > in
> > > > > >> >>Zookeeper
> > > > > >> >> > source code trying to come up with some possible 
> > > > > >> >> > approaches
> > > and I
> > > > > >> >>would
> > > > > >> >> > like to get your inputs on those.
> > > > > >> >> >
> > > > > >> >> > Initial conditions:
> > > > > >> >> >
> > > > > >> >> > * I have an ensemble of five Zookeeper servers running
> v3.4.5
> > > > code.
> > > > > >> >> > * The size of a committed snapshot file is in vicinity 
> > > > > >> >> > of
> > 1GB.
> > > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > > >> >> > * Clients a heavily read biased, i.e., they mostly 
> > > > > >> >> > read and
> > > > rarely
> > > > > >> >> write. I
> > > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > > >> >> >
> > > > > >> >> > Problem statement:
> > > > > >> >> >
> > > > > >> >> > * Under certain conditions, I may need to revert the 
> > > > > >> >> > data
> > > stored
> > > > in
> > > > > >> >>the
> > > > > >> >> > ensemble to an earlier state. For example, one of the
> clients
> > > may
> > > > > >>ruin
> > > > > >> >> the
> > > > > >> >> > application-level data integrity and I need to perform 
> > > > > >> >> > a
> > > disaster
> > > > > >> >> recovery.
> > > > > >> >> >
> > > > > >> >> > Things look nice and easy if I'm dealing with a single
> > > Zookeeper
> > > > > >> >>server.
> > > > > >> >> A
> > > > > >> >> > file-level copy of the data and dataLog directories 
> > > > > >> >> > should
> > > allow
> > > > > >>me to
> > > > > >> >> > recover later by stopping Zookeeper, swapping the 
> > > > > >> >> > corrupted
> > > data
> > > > > >>and
> > > > > >> >> > dataLog directories with a backup, and firing 
> > > > > >> >> > Zookeeper
> back
> > > up.
> > > > > >> >> >
> > > > > >> >> > Now, the ensemble deployment and the leader election
> > algorithm
> > > in
> > > > > >>the
> > > > > >> >> > quorum make things much more difficult. In order to 
> > > > > >> >> > restore
> > > from
> > > > a
> > > > > >> >>single
> > > > > >> >> > file-level backup, I need to take the whole ensemble 
> > > > > >> >> > down,
> > wipe
> > > > out
> > > > > >> >>data
> > > > > >> >> > and dataLog directories on all servers, replace these
> > > directories
> > > > > >>with
> > > > > >> >> > backed up content on one of the servers, bring this 
> > > > > >> >> > server
> up
> > > > > >>first,
> > > > > >> >>and
> > > > > >> >> > then bring up the rest of the ensemble. This 
> > > > > >> >> > [somewhat]
> > > > guarantees
> > > > > >> >>that
> > > > > >> >> the
> > > > > >> >> > populated Zookeeper server becomes a member of a 
> > > > > >> >> > majority
> and
> > > > > >> >>populates
> > > > > >> >> the
> > > > > >> >> > ensemble. This approach works but it is very involving 
> > > > > >> >> > and,
> > > thus,
> > > > > >> >> > error-prone due to a human error.
> > > > > >> >> >
> > > > > >> >> > Based on a study of Zookeeper source code, I am 
> > > > > >> >> > considering
> > the
> > > > > >> >>following
> > > > > >> >> > alternatives. And I seek advice from Zookeeper 
> > > > > >> >> > development
> > > > > >>community
> > > > > >> >>as
> > > > > >> >> to
> > > > > >> >> > which approach looks more promising or if there is a 
> > > > > >> >> > better
> > > way.
> > > > > >> >> >
> > > > > >> >> > Approach #1:
> > > > > >> >> >
> > > > > >> >> > Develop a complementary pair of utilities for export 
> > > > > >> >> > and
> > import
> > > > of
> > > > > >>the
> > > > > >> >> > data. Both utilities will act as Zookeeper clients and 
> > > > > >> >> > use
> > the
> > > > > >> >>existing
> > > > > >> >> > API. The "export" utility will recursively retrieve 
> > > > > >> >> > data
> and
> > > > store
> > > > > >>it
> > > > > >> >>in
> > > > > >> >> a
> > > > > >> >> > file. The "import" utility will first purge all data 
> > > > > >> >> > from
> the
> > > > > >>ensemble
> > > > > >> >> and
> > > > > >> >> > then reload it from the file.
> > > > > >> >> >
> > > > > >> >> > This approach seems to be the simplest and there are
> similar
> > > > tools
> > > > > >> >> > developed already. For example, the Guano Project:
> > > > > >> >> > https://github.com/d2fn/guano
> > > > > >> >> >
> > > > > >> >> > I don't like two things about it:
> > > > > >> >> > * Poor performance even on a backup for the data store 
> > > > > >> >> > of
> my
> > > > size.
> > > > > >> >> > * Possible data consistency issues due to concurrent 
> > > > > >> >> > access
> > by
> > > > the
> > > > > >> >>export
> > > > > >> >> > utility as well as other "normal" clients.
> > > > > >> >> >
> > > > > >> >> > Approach #2:
> > > > > >> >> >
> > > > > >> >> > Add another four-letter command that would force 
> > > > > >> >> > rolling up
> > the
> > > > > >> >> > transactions and creating a snapshot. The result of 
> > > > > >> >> > this
> > > command
> > > > > >>would
> > > > > >> >> be a
> > > > > >> >> > new snapshot.XXXX file on disk and the name of the 
> > > > > >> >> > file
> could
> > > be
> > > > > >> >>reported
> > > > > >> >> > back to the client as a response to the four-letter
> command.
> > > This
> > > > > >> >>way, I
> > > > > >> >> > would know which snapshot file to grab for future 
> > > > > >> >> > possible
> > > > restore.
> > > > > >> >>But
> > > > > >> >> > restoring from a snapshot file is almost as involving 
> > > > > >> >> > as
> the
> > > > > >> >>error-prone
> > > > > >> >> > sequence described in the "Initial conditions" above.
> > > > > >> >> >
> > > > > >> >> > Approach #3:
> > > > > >> >> >
> > > > > >> >> > Come up with a way to temporarily add a new Zookeeper
> server
> > > > into a
> > > > > >> >>live
> > > > > >> >> > ensemble, that would overtake (how?) the leader role 
> > > > > >> >> > and
> push
> > > out
> > > > > >>the
> > > > > >> >> > snapshot that it has into all ensemble members upon
> restore.
> > > This
> > > > > >> >> approach
> > > > > >> >> > could be difficult and error-prone to implement 
> > > > > >> >> > because it
> > will
> > > > > >> >>require
> > > > > >> >> > hacking the existing election algorithm to designate a
> > leader.
> > > > > >> >> >
> > > > > >> >> > So, which of the approaches do you think works best 
> > > > > >> >> > for an
> > > > ensemble
> > > > > >> >>and
> > > > > >> >> for
> > > > > >> >> > the database size of about 1GB?
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Any advice will be highly appreciated!
> > > > > >> >> > /Sergey
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to