from:"Ted Dunning"

question about ZK robustness

2010-11-20 Thread Ted Dunning

I was just asked a very cogent question of the form "how do you know" and
would like somebody who knows better than I do to confirm or deny my
response.  The only part that I am absolutely sure of is the part at the end
where I say "No doubt I have omitted something".  With an edit from Ben,
this probably should become a wiki page.

Here is the conversation.  Please mark my errors or elisions if you can.

 You have to educate me in how ZK does data-integrity checking to avoid
> propagating accidental data-corruption (eg, from an ext3 bug, faulty drive,
> etc). We might have to augment ZK to add that ability if it doesn't already.
>

There are several mechanisms at the application API layer, in ZK internals
and in the snapshot and transaction log formats.

At the application layer, all updates are atomic and all update the entire
contents.  If you provide a version number with the update instead of -1,
then your update will only succeed if the version being updated matches that
version.

All updates are strictly serialized so there are no race conditions on
updates.  This makes lots of things simpler inside of ZK.

All updates go through the ZK master and are only committed when replicated
to a quorum of the cluster.  The quorum is ceil((n+1)/2) where n is the
configured number of servers.  Committed means that the update has been
flushed to the transaction log on disk.  The replication of updates is from
master memory to slave memory rather than master memory to disk to slave.

All logs and snapshots have application level CRC's and don't depend on disk
ECC for correctness.

At cluster start, logs are examined to determine the latest transaction that
was committed correctly.  At least a quorum must have the latest update
because of the update semantics and the strong CRC's prevent a partially
written transaction from being read as being acceptable.  I don't know how
the last update id is stored exactly.

Reads can give stale data for short periods of time, but will always give a
coherently updated picture of what the master knew at some point in the
past.  If a client connection is not lost, then the clients view of the last
update id is monotonic increasing.  If a client loses a connection and
reconnects it is conceivable that after reconnecting, they will see an
earlier version of the universe, but this is exceedingly unlikely because
the time required to reconnect is typically longer than how far out of date
any ZK cluster member can typically be.  You can cause this if you connect
to a ZK cluster member, partition the cluster so that your connected server
is in the majority, update and then connect to a cluster member that is
separated from the master of the ZK cluster.  You can always do a sync for
force the server you are using to catch up to the master (at least for the
moment).  The use of sync will force a monotonic view of time regardless of
any connection/disconnection/reconnection scenario.

Since all updates must go through the cluster master, updates cannot happen
in the cluster split brain scenarios and the strong serialization guarantees
for all updates are maintained.

The only data loss scenarios I have heard of in about 3 years of watching
all ZK mailing list traffic and all bug reports have had to do with the
theoretical possibility of problems due to disk controllers that lie about
when data is persisted.  I have seen cluster failures where memory is
exhausted, but I can't remember any that caused loss of data.  There was one
bug where somehow a cluster member remembered a transaction id that was one
past the last transaction that had been committed to the logs.  This was a
very strange coincidence of very aggressive operator error and a bug which
has been fixed.  The cluster refused to restart in this case, but the fix
was simply to delete the log on the confused machine and restart the
cluster.  No data was lost.

I have, no doubt, omitted something.

Re: Persistent watch stream?

2010-11-12 Thread Ted Dunning

Persistent watches were omitted from ZK on purpose because of the perceived
danger of not have a load shedding mechanism.

Note that when you get a notification, the query you do to get the next
state typically sets the next watch.  This guarantees that you don't lose
anything, but it may mean that you effectively get notified of multiple
changes in one notification.

If you really, really want all notifications in order, then what you are
really looking for is a kind of a distributed transaction log.  For small
applications, you can implement this by writing logs into ZK znodes.  Your
client should remember where they were and as they are notified, they read
the current file to catch up.  This has the downside that to update a file
you have to completely rewrite it which makes it inconvenient to put a bunch
of stuff into a single chunk of the log.  You would also need a watcher on
the directory to notify you when new log files are created.  Aside from the
slightly quadratic update problem, this does what you need.

You can also check out Bookkeeper which is a more scalable distributed
transaction logging system.  It addresses the complications of (mis) using
ZK as a log system and attempts to give you a reliable and robust
transaction log.

On Fri, Nov 12, 2010 at 7:16 PM, Chang Song  wrote:

>
> Hi.
>
> It's been a couple of weeks since we started integrating zookeeper into our
> app.
> Great piece of software. Cut down LOC by more than half.
> Thank you for open sourcing Zookeeper.
>
> I have a couple of questions, though.
> I was wondering you have any plan to support persistent watch stream
> feature?
>
> One-time watch is very inconvenient for us.
> We need to get a stream of notifications in order. We are talking thousands
> of clients.
> Since notification can happen in bulk, we need to set a watch first, and
> once we get a callback,
> we need to check periodically what happens to the children of the watch
> nodes.
>
> You can consider this as a automatic server-side watch registration
> feature: I'd call this sticky watch.
> I think it is easier to implement this as a sticky watch.
>
> Another question is if it is possible to coalesce predefined number of
> changes for watch callbacks.
> We have observed that if there are many changes in children node, clients
> gets different number of
> messages in bulks. This makes the architecture of our application from
> event-driven to polling for a while.
> Thus pretty much the same reasoning behind the first question.
>
> Any comments welcome.
>
> Thank you.
>
> Chang
>
>
>
>

Re: Running cluster behind load balancer

2010-11-03 Thread Ted Dunning

DNS round-robin works as well.

On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reed  wrote:

> it would have to be a TCP based load balancer to work with ZooKeeper
> clients, but other than that it should work really well. The clients will be
> doing heart beats so the TCP connections will be long lived. The client
> library does random connection load balancing anyway.
>
> ben
>
> On 11/03/2010 12:19 PM, Luka Stojanovic wrote:
>
>> What would be expected behavior if a three node cluster is put behind a
>> load
>> balancer? It would ease deployment because all clients would be configured
>> to target zookeeper.example.com regardless of actual cluster
>> configuration,
>> but I have impression that client-server connection is stateful and that
>> jumping randomly from server to server could bring strange behavior.
>>
>>  Cheers,
>>
>> --
>> Luka Stojanovic
>> lu...@vast.com
>> Platform Engineering
>>
>
>

Re: Client seeing wrong data on nodeDataChanged

2010-10-28 Thread Ted Dunning

On Thu, Oct 28, 2010 at 9:56 PM, Stack  wrote:

> On Thu, Oct 28, 2010 at 7:32 PM, Ted Dunning 
> wrote:
> > Client 2 is not guaranteed to see X if it doesn't get to asking before
> the
> > value has been updated to Y.
> >
> Right, but I wouldn't expect the watch to be triggered twice with value Y.
>

It may not have been.  It may have been triggered with the change from x to
x, but by the time the client got around to
looking when the value was already y.  The trigger and the value are not
connected.

Re: Client seeing wrong data on nodeDataChanged

2010-10-28 Thread Ted Dunning

Client 2 is not guaranteed to see X if it doesn't get to asking before the
value has been updated to Y.

On Thu, Oct 28, 2010 at 2:39 PM, Stack  wrote:

> Client 2 is also watching the znode.  It gets notified three times:
> two nodeDataChanged events(only) and a nodeDeleted event.  I'd expect
> 3 nodeDataChanged events but understand a client might skip states.
> The problem is that when client 2 looks at the data in the znode on
> nodeDataChanged, for both cases the data is Y.  Not X and then Y, but
> Y both times.  This is unexpected.
>

Re: znode recovery automatically?

2010-10-21 Thread Ted Dunning

On Thu, Oct 21, 2010 at 9:08 AM, Sean Bigdatafun
wrote:

> Can a lost znode be recovered automatically? Say, in a 3 znodes Zookeeper
> cluster, the cluster get into a critical status if a znode is lost. If I
> bring that lost znode back into running, can it rejoin the quorum?
>

Yes.

Note that you have a terminological confusion here.

A znode is a Zookeeper data object roughly equivalent to a file.  You don't
mean this, I think.

A Zookeeper server or node is a computer that is running a copy of
Zookeeper.  You mean this.

The answer is, yes, the node can rejoin.  Moreover, a new node can join and
you can progressively update the configuration
for the existing nodes one at a time.

> If it can, then that means my zookeeper cluster can run forever if I can
> somehow take care of my znodes (say, running a watchdog);

Correct.  You can even keep your cluster running across minor upgrades by
using the rolling restart
idea from above.  Likewise, if you decide to patch and reboot the machines
involved, you can do it
one machine at a time and avoid any downtime for the ZK cluster itself.

> if it can not, then that means my zookeeper cluster will need get restarted
> after a long
> period of time (because you will lose the total zookeeper cluster after
> losing two znodes -- and that can happen if there is not mechanism for
> znode
> to rejoin)
>

This is not a worry.

It is common for ZK clusters to last > 1 year.  Because of the rolling
restart trick, the uptime for
the ZK cluster can easily exceed the uptime for single machine.

In some ways, a ZK cluster is a bit like the story of the Abe Lincoln's ax.
 (See http://en.wikipedia.org/wiki/Ship_of_Theseus or
http://www.nytimes.com/books/first/m/mansfield-ax.html)

Re: Retrying sequential znode creation

2010-10-20 Thread Ted Dunning

These corner cases are relatively rare, I would think (I personally keep
logs around for days or longer).

Would it be possible to get a partial solution in place that invokes the
current behavior if logs aren't available?

On Wed, Oct 20, 2010 at 10:42 AM, Patrick Hunt  wrote:

> Hi Ted, Mahadev is in the best position to comment (he looked at it last)
> but iirc when we started looking into implementing this we immediately ran
> into so big questions. One was what to do if the logs had been cleaned up
> and the individual transactions no longer available. This could be overcome
> by changes wrt cleanup, log rotation, etc... There was another more
> bulletproof option, essentially to keep all the changes in memory that
> might
> be necessary to implement 22, however this might mean a significant
> increase
> in mem requirements and general bookkeeping. It turned out (again correct
> me
> if I'm wrong) that more thought was going to be necessary, esp around
> ensuring correct operation in any/all special cases.
>
> Patrick
>
> On Wed, Oct 13, 2010 at 12:49 PM, Ted Dunning 
> wrote:
>
> > Patrick,
> >
> > What are these hurdles?  The last comment on ZK-22 was last winter.  Back
> > then, it didn't sound like
> > it was going to be that hard.
> >
> > On Wed, Oct 13, 2010 at 12:08 PM, Patrick Hunt  wrote:
> >
> > > 22 would help with this issue
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-22
> > > however there are some real hurdles to implementing 22 successfully.
> > >
> >
>

Re: Testing zookeeper outside the source distribution?

2010-10-18 Thread Ted Dunning

Generally, I think a better way to do this is to use a standard mock object
framework.  Then you don't have to fake up an interface.

But the original poster probably has a need to do integration tests more
than unit tests.  In such tests, they need to test against a real ZK to make
sure that their assumptions about the semantics of ZK are valid.

On Mon, Oct 18, 2010 at 8:53 AM, David Rosenstrauch wrote:

> Consequently, the way I write my code for ZooKeeper is against a more
> generic interface that provides operations for open, close, getData, and
> setData.  When unit testing, I substitute in a "dummy" implementation that
> just stores data in memory (i.e., a HashMap); when running live code I use
> an implementation that talks to ZooKeeper.
>

Re: Retrying sequential znode creation

2010-10-13 Thread Ted Dunning

Patrick,

What are these hurdles?  The last comment on ZK-22 was last winter.  Back
then, it didn't sound like
it was going to be that hard.

On Wed, Oct 13, 2010 at 12:08 PM, Patrick Hunt  wrote:

> 22 would help with this issue
> https://issues.apache.org/jira/browse/ZOOKEEPER-22
> however there are some real hurdles to implementing 22 successfully.
>

Re: Retrying sequential znode creation

2010-10-12 Thread Ted Dunning

Yes.  This is indeed a problem.  I generally try to avoid sequential nodes
unless they are ephemeral and if I get an error on
creation, I generally have to either tear down the connection (losing all
other ephemeral nodes in the process) or scan through
all live nodes trying to determine if mine got created.  Neither is a very
acceptable answer so I try to avoid the problem.

Your UUID answer is one option.  At least you know what file got created (or
not) and with good naming you can pretty much guarantee no collisions.  You
don't have to scan all children since you can simply check for the existence
of the file of interest.

There was a JIRA filed that was supposed to take care of this problem, but I
don't know the state of play there.

On Tue, Oct 12, 2010 at 12:11 PM, Vishal K  wrote:

> Hi,
>
> What is the best approach to have an idempotent create() operation for a
> sequential node?
>
> Suppose a client is trying to create a sequential node and it gets a
> ConnectionLoss KeeperException, it cannot know for sure whether the request
> succeeded or not. If in the meantime, the client's session is
> re-established, the client would like to create a sequential znode again.
> However, the client needs to know if its earlier request has succeeded or
> not. If it did, then the client does not need to retry. To my understanding
> ZooKeeper does not provide this feature. Can someone confirm this?
>
> External to ZooKeeper, the client can either set a unique UUID in the path
> to the create call or write the UUID as part of its data. Before retrying,
> it can read back all the children of the parent znode and go through the
> list to determine if its earlier request had succeeded. This doesn't sound
> that appealing to me.
>
> I am guessing this is a common problem that many would have faced. Can
> folks
> give a feedback on what their approach was?
>
> Thanks.
> -Vishal
>

Re: Membership using ZK

2010-10-12 Thread Ted Dunning

Yes.  You should get that event.

You should also debug why you are getting disconnected in the first place.
 This is often a symptom of something really bad that is happening on your
client side such as very long GC's.  If these are unavoidable, then you need
to adjust the timeouts with ZK to reflect reality.  Another possibility is
that your network connections are dropping or that your application is
freezing for a non-GC reason.  Any of these problems are something you
should address.

Of course, the connection loss event should be handled correctly as well
since honest to god disconnects can happen.

On Tue, Oct 12, 2010 at 10:57 AM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> Would my watcher get invoked on this ConnectionLoss event? If so I am
> thinking I will check for KeeperState.Disconnected and reset my state. Is
> my
> understanding correct? Please advice.
>
> Thanks
> Avinash
>
> On Tue, Oct 12, 2010 at 10:45 AM, Benjamin Reed 
> wrote:
>
> >  ZooKeeper considers a client dead when it hasn't heard from that client
> > during the timeout period. clients make sure to communicate with
> ZooKeeper
> > at least once in 1/3 the timeout period. if the client doesn't hear from
> > ZooKeeper in 2/3 the timeout period, the client will issue a
> ConnectionLoss
> > event and cause outstanding requests to fail with a ConnectionLoss.
> >
> > So, if ZooKeeper decides a process is dead, the process will get a
> > ConnectionLoss event. Once ZooKeeper decides that a client is dead, if
> the
> > client reconnects, the client will get a SessionExpired. Once a session
> is
> > expired, the expired handle will become useless, so no new requests, no
> > watches, etc.
> >
> > The bottom line is if your process gets a process expired, you need to
> > treat that process as expired and recover by creating a new zookeeper
> handle
> > (possibly by restarting the process) and resetup your state.
> >
> > ben
> >
> >
> > On 10/12/2010 09:54 AM, Avinash Lakshman wrote:
> >
> >> This is what I have going:
> >>
> >> I have a bunch of 200 nodes come up and create an ephemeral entry under
> a
> >> znode names /Membership. When nodes are detected dead the node
> associated
> >> with the dead node under /Membership is deleted and watch delivered to
> the
> >> rest of the members. Now there are circumstances a node A is deemed dead
> >> while the process is still up and running on A. It is a false detection
> >> which I need to probably deal with. How do I deal with this situation?
> >>  Over
> >> time false detections delete all the entries underneath the /Membership
> >> znode even though all processes are up and running.
> >>
> >> So my questions are:
> >> Would the watches be pushed out to the node that is falsely deemed dead?
> >> If
> >> so I can have that process recreate the ephemeral znode underneath
> >> /Membership.
> >> If a node leaves a watch and then truly crashes. When it comes back up
> >> would
> >> it get watches it missed during the interim period? In any case how do
> >> watches behave in the event of false/true failure detection?
> >>
> >> Thanks
> >> A
> >>
> >
> >
>

Re: is zookeeper suitable for my application?

2010-10-08 Thread Ted Dunning

ZK provides all of the coordination you need for this problem, but you
should store your data elsewhere.

Any key-data store with decent read-write speed will suffice.  Memcache
would be reasonable for that if
you don't need persistence in the presence of failure.  Voldemort would be
another alternative if you do
need persistence.

Couchdb is generally pretty slow as a key/value store and is probably not a
very good option for this.

On Fri, Oct 8, 2010 at 3:31 AM, Li Li  wrote:

>it seems that znode's size should less than 1 mb. but I will need
> save large data file in a znode. Is there any solution for this?
>thank you.
>

Re: Changing configuration

2010-10-07 Thread Ted Dunning

Restart all clients ... eventually.  No need for a grand hurry unless your
ZK servers are very busy.

On Thu, Oct 7, 2010 at 2:29 PM, Adam Lazur  wrote:

> Restart all clients too.
>
> .laz
>
> Patrick Hunt (ph...@apache.org) said:
> > You probably want to do a "rolling restart", this is preferable over
> > "restart the cluster" as the service will not go down.
> > http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6
> >
> > Patrick
> >
> > On Wed, Oct 6, 2010 at 9:49 PM, Avinash Lakshman <
> avinash.laksh...@gmail.com
> > > wrote:
> >
> > > Suppose I have a 3 node ZK cluster composed of machines A, B and C. Now
> for
> > > whatever reason I lose C forever and the machine needs to be replaced.
> How
> > > do I handle this situation? Update the config with D in place of C and
> > > restart the cluster? Also if I am interested in read just the ZAB
> portions
> > > which packages should I be looking at?
> > >
> > > Cheers
> > > A
> > >
>

Re: Zookeeper on 60+Gb mem

2010-10-05 Thread Ted Dunning

That would be an interesting experiment although it is way outside normal
usage as a coordination store.

I have used ZK as a session store for PHP with OK results.  I never
implemented an expiration mechanism so things
had to be cleared out manually sometimes.  It worked pretty well until
things filled up.

On Tue, Oct 5, 2010 at 11:03 AM, Maarten Koopmans wrote:

> Hi,
>
> I just wondered: has anybody ever ran zookeeper "to the max" on a 68GB
> quadruple extra large high memory EC2 instance? With, say, 60GB allocated or
> so?
>
> Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on
> the ebs columes, upgrade as your memory utilization grows)  - I just
> wonder what the limits are there, or if I am foing where angels fear to
> tread...
>
> --Maarten

Re: ZK compatability

2010-09-30 Thread Ted Dunning

Looking forward, I don't think that anybody has even proposed anything that
would require a major release yet.

That should mean that you have quite a bit of lifetime ahead on the 3.x
family.  Moreover, it is a cinch to bet that
even when a 4.0 is released, it is unlikely to have enough killer features
to drive wholesale adoption right away.

That means that there will be a 3.x bug-fix branch for quite a while even
after 4.x versions come out.

ZK has the least operations overhead of any software I have ever deployed in
to a system.  The worst problem is
that you have to document procedures because you don't have to touch ZK
often enough to remember them
accurately.

On Thu, Sep 30, 2010 at 10:29 AM, Patrick Hunt  wrote:

> Historically major releases can have non-bw compatible changes.  However if
> you look back through the release history you'll see that the last time
> that
> happened was oct 2008, when we moved the project from sourceforge to
> apache.
>
> Patrick
>
> On Tue, Sep 28, 2010 at 11:37 AM, Jun Rao  wrote:
>
> > What about major releases going forward? Thanks,
>

Re: Expiring session... timeout of 600000ms exceeded

2010-09-21 Thread Ted Dunning

Generally best practices for crawlers is that no process runs more than an
hour or five.  All crawler processes update
a central state store with their progress, but they exit when they reach a
time limit knowing that somebody else will
take up the work where they leave off.  This avoids a multitude of ills.

On Tue, Sep 21, 2010 at 11:53 AM, Tim Robertson
wrote:

> > On the topic of your application, why you are using processes instead of
> > threads?  With threads, you can get your memory overhead down to 10's of
> > kilobytes as opposed to 10's of megabytes.
>
> I am just prototyping scaling out many processes and potentially
> across multiple machines.  Our live crawler runs in a single JVM, but
> some of these crawlers take 4-6 weeks, so long running processes block
> others, so I was looking at alternatives - our live crawler also uses
> DOM based XML parsing so hitting memory limits - SAX would address
> this.  Also we want to be able to deploy patches to the crawlers
> without interrupting those long running jobs if possible.

Re: Expiring session... timeout of 600000ms exceeded

2010-09-21 Thread Ted Dunning

To answer your last question first, no you don't have to do anything
explicit to keep the ZK connection alive.  It is maintained by a dedicated
thread.  You do have to keep your java program responsive and ZK problems
like this almost always indicate that you have a problem with your program
checking out for extended periods of time.

My strong guess is that you have something evil happening with your java
process that is actually causing this delay.

Since you have tiny memory, it probably isn't GC.  Since you have a bunch of
processes, swap and process wakeup delays seem plausible.  What is the load
average on your box?

On the topic of your application, why you are using processes instead of
threads?  With threads, you can get your memory overhead down to 10's of
kilobytes as opposed to 10's of megabytes.

Also, why not use something like Bixo so you don't have to prototype a
threaded crawler?

On Tue, Sep 21, 2010 at 8:24 AM, Tim Robertson wrote:

> Hi all,
>
> I am seeing a lot of my clients being kicked out after the 10 minute
> negotiated timeout is exceeded.
> My clients are each a JVM (around 100 running on a machine) which are
> doing web crawling of specific endpoints and handling the response XML
> - so they do wait around for 3-4 minutes on HTTP timeouts, but
> certainly not 10 mins.
> I am just prototyping right now on a 2xquad core mac pro with 12GB
> memory, and the 100 child processes only get -Xmx64m and I don't see
> my machine exhausted.
>
> Do my clients need to do anything in order to initiate keep alive
> heart beats or should this be automatic (I thought the ticktime would
> dictate this)?
>
> # my conf is:
> tickTime=2000
> dataDir=/Volumes/Data/zookeeper
> clientPort=2181
> maxClientCnxns=1
> minSessionTimeout=4000
> maxSessionTimeout=80
>
> Thanks for any pointers to this newbie,
> Tim
>

Re: possible bug in zookeeper ?

2010-09-14 Thread Ted Dunning

Also try the four letter commands to each server.

On Tue, Sep 14, 2010 at 9:20 AM, Patrick Hunt  wrote:

> That is unusual. I don't recall anyone reporting a similar issue, and
> looking at the code I don't see any issues off hand. Can you try the
> following?
>
> 1) on that particular zk client machine resolve the hosts
> zook1/zook2/zook3,
> what ip addresses does this resolve to? (try dig)
> 2) try running the client using the 3.3.1 jar file (just replace the jar on
> the client), it includes more log4j information, turn on DEBUG or TRACE
> logging
>
> Patrick
>
> On Tue, Sep 14, 2010 at 8:44 AM, Yatir Ben Shlomo  >wrote:
>
> > zook1:2181,zook2:2181,zook3:2181
> >
> >
> > -Original Message-
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Tuesday, September 14, 2010 4:11 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: possible bug in zookeeper ?
> >
> > What was the list of servers that was given originally to open the
> > connection to ZK?
> >
> > On Tue, Sep 14, 2010 at 6:15 AM, Yatir Ben Shlomo  > >wrote:
> >
> > > Hi I am using solrCloud which uses an ensemble of 3 zookeeper
> instances.
> > >
> > > I am performing survivability  tests:
> > > Taking one of the zookeeper instances down I would expect the client to
> > use
> > > a different zookeeper server instance.
> > >
> > > But as you can see in the below logs attached
> > > Depending on which instance I choose to take down (in my case,  the
> last
> > > one in the list of zookeeper servers)
> > > the client is constantly insisting on the same zookeeper server
> > (Attempting
> > > connection to server zook3/192.168.252.78:2181)
> > > and not switching to a different one
> > > the problem seems to arrive from ClientCnxn.java
> > > Any one has an idea on this ?
> > >
> > > Solr cloud currently is using  zookeeper-3.2.2.jar
> > > Is this a know bug that was fixed in later versions ?( 3.3.1)
> > >
> > > Thanks in advance,
> > > Yatir
> > >
> > >
> > > Logs:
> > >
> > > Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> > > WARNING: Ignoring exception during shutdown input
> > > java.nio.channels.ClosedChannelException
> > >at
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
> > >at
> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > >at
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999)
> > >at
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> > > Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> > > WARNING: Ignoring exception during shutdown output
> > > java.nio.channels.ClosedChannelException
> > >at
> > > sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
> > >at
> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
> > >at
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):1004)
> > >at
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> > > Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info
> > > INFO: Attempting connection to server zook3/192.168.252.78:2181
> > > Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> > > WARNING: Exception closing session 0x32b105244a20001 to
> > > sun.nio.ch.selectionkeyi...@3ca58cbf
> > > java.net.ConnectException: Connection refused
> > >at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native
> Method)
> > >at
> > sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> > >at
> > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> > >at
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):933)
> > > Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> > > WARNING: Ignoring exception during shutdown input
> > > java.nio.channels.ClosedChannelException
> > >at
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
> > >at
> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > >at
> > >
> &

Re: possible bug in zookeeper ?

2010-09-14 Thread Ted Dunning

And that you can connect from every client to every server?

On Tue, Sep 14, 2010 at 9:07 AM, Mahadev Konar wrote:

> Hi yatir,
>  Can you confirm that zook1 , zook2 can be nslookedup from the client
> machine?
>
> We havent seen a bug like this. It would be great to nail this down.
>
> Thanks
> mahadev
>
>
> On 9/14/10 8:44 AM, "Yatir Ben Shlomo"  wrote:
>
> > zook1:2181,zook2:2181,zook3:2181
> >
> >
> > -Original Message-
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Tuesday, September 14, 2010 4:11 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: possible bug in zookeeper ?
> >
> > What was the list of servers that was given originally to open the
> > connection to ZK?
> >
> > On Tue, Sep 14, 2010 at 6:15 AM, Yatir Ben Shlomo  >wrote:
> >
> >> Hi I am using solrCloud which uses an ensemble of 3 zookeeper instances.
> >>
> >> I am performing survivability  tests:
> >> Taking one of the zookeeper instances down I would expect the client to
> use
> >> a different zookeeper server instance.
> >>
> >> But as you can see in the below logs attached
> >> Depending on which instance I choose to take down (in my case,  the last
> >> one in the list of zookeeper servers)
> >> the client is constantly insisting on the same zookeeper server
> (Attempting
> >> connection to server zook3/192.168.252.78:2181)
> >> and not switching to a different one
> >> the problem seems to arrive from ClientCnxn.java
> >> Any one has an idea on this ?
> >>
> >> Solr cloud currently is using  zookeeper-3.2.2.jar
> >> Is this a know bug that was fixed in later versions ?( 3.3.1)
> >>
> >> Thanks in advance,
> >> Yatir
> >>
> >>
> >> Logs:
> >>
> >> Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> >> WARNING: Ignoring exception during shutdown input
> >> java.nio.channels.ClosedChannelException
> >>at
> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
> >>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> >>at
> >>
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java)
> >> :999)
> >>at
> >>
>
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970>>
> )
> >> Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> >> WARNING: Ignoring exception during shutdown output
> >> java.nio.channels.ClosedChannelException
> >>at
> >> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
> >>at
> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
> >>at
> >>
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java)
> >> :1004)
> >>at
> >>
>
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970>>
> )
> >> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info
> >> INFO: Attempting connection to server zook3/192.168.252.78:2181
> >> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> >> WARNING: Exception closing session 0x32b105244a20001 to
> >> sun.nio.ch.selectionkeyi...@3ca58cbf
> >> java.net.ConnectException: Connection refused
> >>at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native
> Method)
> >>at
> sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> >>at
> >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> >>at
> >>
>
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):933>>
> )
> >> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> >> WARNING: Ignoring exception during shutdown input
> >> java.nio.channels.ClosedChannelException
> >>at
> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
> >>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> >>at
> >>
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java)
> >> :999)
> >>at
> >>
>
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970>>
> )
> >> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> >> WARNING: Ignoring exceptio

Re: possible bug in zookeeper ?

2010-09-14 Thread Ted Dunning

What was the list of servers that was given originally to open the
connection to ZK?

On Tue, Sep 14, 2010 at 6:15 AM, Yatir Ben Shlomo wrote:

> Hi I am using solrCloud which uses an ensemble of 3 zookeeper instances.
>
> I am performing survivability  tests:
> Taking one of the zookeeper instances down I would expect the client to use
> a different zookeeper server instance.
>
> But as you can see in the below logs attached
> Depending on which instance I choose to take down (in my case,  the last
> one in the list of zookeeper servers)
> the client is constantly insisting on the same zookeeper server (Attempting
> connection to server zook3/192.168.252.78:2181)
> and not switching to a different one
> the problem seems to arrive from ClientCnxn.java
> Any one has an idea on this ?
>
> Solr cloud currently is using  zookeeper-3.2.2.jar
> Is this a know bug that was fixed in later versions ?( 3.3.1)
>
> Thanks in advance,
> Yatir
>
>
> Logs:
>
> Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown input
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown output
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
>at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):1004)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info
> INFO: Attempting connection to server zook3/192.168.252.78:2181
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Exception closing session 0x32b105244a20001 to
> sun.nio.ch.selectionkeyi...@3ca58cbf
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
>at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):933)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown input
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown output
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
>at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):1004)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info
> INFO: Attempting connection to server zook3/192.168.252.78:2181
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Exception closing session 0x32b105244a2 to
> sun.nio.ch.selectionkeyi...@3960f81b
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
>at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):933)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown input
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970)
> Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn
> WARNING: Ignoring exception during shutdown output
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownOutp

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Ted Dunning

A switch failure could do that, I think.

On Fri, Sep 10, 2010 at 1:49 PM, Fournier, Camille F. [Tech] <
camille.fourn...@gs.com> wrote:

> I am not a networking expert, but in my experience I've seen network
> glitches that cause sockets to appear to be live that are actually dead, but
> not vice-versa. Can you tell me what would cause a socket closure with
> otherwise alive client and server?

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Ted Dunning

They would only get expired sessions if they don't reconnect to another
server within a relatively short timeout (at least according to my original
idea... I haven't looked at Camille's suggestion carefully enough to see if
she meant that).

As I see it, the server that loses the client should propagate a close to
the leader when the client disappears.  The leader should note the time and
prepare to expire that session.  When the client reconnects somewhere, that
connection should be propagated to the leader who should cancel the timeout
and switch back to the long timeout accorded to heartbeats from the client.

On Fri, Sep 10, 2010 at 1:01 PM, Benjamin Reed  wrote:

> the thing that worries me about this functionality in general is that
> network anomalies can cause a whole raft of sessions to get expired in this
> way. for example, you have 3 servers with load spread well; there is a
> networking glitch that cause clients to abandon a server; suddenly 1/3 of
> your clients will get expired sessions.
>
>

Re: Understanding ZooKeeper data file management and LogFormatter

2010-09-08 Thread Ted Dunning

Due to issues in my fingers and brain.

On Wed, Sep 8, 2010 at 1:20 PM, Vishal K  wrote:

> Thanks Ted.  Did you have to unwind the cluster due to data consistency
> issues or due to issues at the application?
>
> On Wed, Sep 8, 2010 at 4:06 PM, Ted Dunning  wrote:
>
> > I have used old snapshot files exactly once when I deleted a bunch of
> > server
> > state trying to unwind a tangled
> > cluster.
> >
> > I keep a few around just for backup purposes.
> >
> > On Wed, Sep 8, 2010 at 12:01 PM, Vishal K  wrote:
> >
> > > Hi All,
> > >
> > > Can you please share your experience regarding ZK snapshot retention
> and
> > > recovery policies?
> > >
> > > We have an application where we never need to rollback (i.e., revert
> back
> > > to
> > > a previous state by using old snapshots). Given this, I am trying to
> > > understand under what circumstances would we ever need to use old ZK
> > > snapshots. I understand a lot of these decisions depend on the
> > application
> > > and amount of redundancy used at every level (e.g,. RAID level where
> the
> > > snapshots are stored etc) in the product. To simplify the discussion, I
> > > would like to rule out any application characteristics and focus mainly
> > on
> > > data consistency.
> > >
> > > - Assuming that we have a 3 node cluster I am trying to figure out when
> > > would I really need to use old snapshot files. With 3 nodes we already
> > have
> > > at least 2 servers with consistent database. If I loose files on one of
> > the
> > > servers, I can use files from the other. In fact, ZK server join will
> > take
> > > care of this. I can remove files from a faulty node and reboot that
> node.
> > > The faulty node will sync with the leader.
> > >
> > > - The old files will be useful if the current snapshot and/or log files
> > are
> > > lost or corrupted on all 3 servers. If  the loss is due to a disaster
> > (case
> > > where we loose all 3 servers), one would have to keep the snapshots on
> > some
> > > external storage to recover. However, if the current snapshot file is
> > > corrupted on all 3 servers, then the most likely cause would be a bug
> in
> > > ZK.
> > > In which case, how can I trust the consistency of the old snapshots?
> > >
> > > - Given a set of snapshots and log files, how can I verify the
> > correctness
> > > of these files? Example, if one of the intermediate snapshot file is
> > > corrupt.
> > >
> > > - The Admin's guide says "Using older log and snapshot files, you can
> > look
> > > at the previous state of ZooKeeper servers and even restore that state.
> > The
> > > LogFormatter class allows an administrator to look at the transactions
> in
> > a
> > > log." * *Is there a tool that does this for the admin?  The
> LogFormatter
> > > only displays the transactions in the log file.
> > >
> > > - Has anyone ever had to play with the snapshot files in production?
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > -Vishal
> > >
> >
>

Re: closing session on socket close vs waiting for timeout

2010-09-08 Thread Ted Dunning

To get it to work in a cluster, what would be necessary?

A new message to the leader to describe connection loss?

On Wed, Sep 8, 2010 at 1:03 PM, Benjamin Reed  wrote:

> unfortunately, that only works on the standalone server.
>
> ben
>
> On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:
>
>> This would be the ideal solution to this problem I think.
>> Poking around the (3.3) code to figure out how hard it would be to
>> implement, I figure one way to do it would be to modify the session timeout
>> to the min session timeout and touch the connection before calling close
>> when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing
>> the code in touch session that returns if the tickTime is greater than the
>> expire time) and it worked (in the standalone server anyway). Interesting
>> solution, or total hack that will not work beyond most basic test case?
>>
>> C
>>
>> (forgive lack of actual code in this email)
>>
>> -Original Message-
>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> Sent: Tuesday, September 07, 2010 1:11 PM
>> To: zookeeper-user@hadoop.apache.org
>> Cc: Benjamin Reed
>> Subject: Re: closing session on socket close vs waiting for timeout
>>
>> This really is, just as Ben says a problem of false positives and false
>> negatives in detecting session
>> expiration.
>>
>> On the other hand, the current algorithm isn't really using all the
>> information available.  The current algorithm is
>> using time since last client initiated heartbeat.  The new proposal is
>> somewhat worse in that it proposes to use
>> just the boolean "has-TCP-disconnect-happened".
>>
>> Perhaps it would be better to use multiple features in order to decrease
>> both false positives and false negatives.
>>
>> For instance, I could imagine that we use the following features:
>>
>> - time since last client hearbeat or disconnect or reconnect
>>
>> - what was the last event? (a heartbeat or a disconnect or a reconnect)
>>
>> Then the expiration algorithm could use a relatively long time since last
>> heartbeat and a relatively short time since last disconnect to mark a
>> session as disconnected.
>>
>> Wouldn't this avoid expiration during GC and cluster partition and cause
>> expiration quickly after a client disconnect?
>>
>>
>> On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt  wrote:
>>
>>
>>
>>> That's a good point, however with suitable documentation, warnings and
>>> such
>>> it seems like a reasonable feature to provide for those users who require
>>> it. Used in moderation it seems fine to me. Perhaps we also make it
>>> configurable at the server level for those administrators/ops who don't
>>> want
>>> to deal with it (disable the feature entirely, or only enable on
>>> particular
>>> servers, etc...).
>>>
>>> Patrick
>>>
>>> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed
>>>  wrote:
>>>
>>>
>>>
>>>> if this mechanism were used very often, we would get a huge number of
>>>> session expirations when a server fails. you are trading fast error
>>>> detection for the ability to tolerate temporary network and server
>>>>
>>>>
>>> outages.
>>>
>>>
>>>> to be honest this seems like something that in theory sounds like it
>>>> will
>>>> work in practice, but once deployed we start getting session expirations
>>>>
>>>>
>>> for
>>>
>>>
>>>> cases that we really do not want or expect.
>>>>
>>>> ben
>>>>
>>>>
>>>> On 09/01/2010 12:47 PM, Patrick Hunt wrote:
>>>>
>>>>
>>>>
>>>>> Ben, in this case the session would be tied directly to the connection,
>>>>> we'd explicitly deny session re-establishment for this session type (so
>>>>> 4 would fail). Would that address your concern, others?
>>>>>
>>>>> Patrick
>>>>>
>>>>> On 09/01/2010 10:03 AM, Benjamin Reed wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> i'm a bit skeptical that this is going to work out properly. a server
>>>>>> may receive a socket reset even though the client is still alive:
>>>>>>
>>>>>> 1) client sends a req

Re: Understanding ZooKeeper data file management and LogFormatter

2010-09-08 Thread Ted Dunning

I have used old snapshot files exactly once when I deleted a bunch of server
state trying to unwind a tangled
cluster.

I keep a few around just for backup purposes.

On Wed, Sep 8, 2010 at 12:01 PM, Vishal K  wrote:

> Hi All,
>
> Can you please share your experience regarding ZK snapshot retention and
> recovery policies?
>
> We have an application where we never need to rollback (i.e., revert back
> to
> a previous state by using old snapshots). Given this, I am trying to
> understand under what circumstances would we ever need to use old ZK
> snapshots. I understand a lot of these decisions depend on the application
> and amount of redundancy used at every level (e.g,. RAID level where the
> snapshots are stored etc) in the product. To simplify the discussion, I
> would like to rule out any application characteristics and focus mainly on
> data consistency.
>
> - Assuming that we have a 3 node cluster I am trying to figure out when
> would I really need to use old snapshot files. With 3 nodes we already have
> at least 2 servers with consistent database. If I loose files on one of the
> servers, I can use files from the other. In fact, ZK server join will take
> care of this. I can remove files from a faulty node and reboot that node.
> The faulty node will sync with the leader.
>
> - The old files will be useful if the current snapshot and/or log files are
> lost or corrupted on all 3 servers. If  the loss is due to a disaster (case
> where we loose all 3 servers), one would have to keep the snapshots on some
> external storage to recover. However, if the current snapshot file is
> corrupted on all 3 servers, then the most likely cause would be a bug in
> ZK.
> In which case, how can I trust the consistency of the old snapshots?
>
> - Given a set of snapshots and log files, how can I verify the correctness
> of these files? Example, if one of the intermediate snapshot file is
> corrupt.
>
> - The Admin's guide says "Using older log and snapshot files, you can look
> at the previous state of ZooKeeper servers and even restore that state. The
> LogFormatter class allows an administrator to look at the transactions in a
> log." * *Is there a tool that does this for the admin?  The LogFormatter
> only displays the transactions in the log file.
>
> - Has anyone ever had to play with the snapshot files in production?
>
> Thanks in advance.
>
> Regards,
> -Vishal
>

Re: closing session on socket close vs waiting for timeout

2010-09-07 Thread Ted Dunning

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean "has-TCP-disconnect-happened".

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt  wrote:

> That's a good point, however with suitable documentation, warnings and such
> it seems like a reasonable feature to provide for those users who require
> it. Used in moderation it seems fine to me. Perhaps we also make it
> configurable at the server level for those administrators/ops who don't
> want
> to deal with it (disable the feature entirely, or only enable on particular
> servers, etc...).
>
> Patrick
>
> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed  wrote:
>
> > if this mechanism were used very often, we would get a huge number of
> > session expirations when a server fails. you are trading fast error
> > detection for the ability to tolerate temporary network and server
> outages.
> >
> > to be honest this seems like something that in theory sounds like it will
> > work in practice, but once deployed we start getting session expirations
> for
> > cases that we really do not want or expect.
> >
> > ben
> >
> >
> > On 09/01/2010 12:47 PM, Patrick Hunt wrote:
> >
> >> Ben, in this case the session would be tied directly to the connection,
> >> we'd explicitly deny session re-establishment for this session type (so
> >> 4 would fail). Would that address your concern, others?
> >>
> >> Patrick
> >>
> >> On 09/01/2010 10:03 AM, Benjamin Reed wrote:
> >>
> >>
> >>> i'm a bit skeptical that this is going to work out properly. a server
> >>> may receive a socket reset even though the client is still alive:
> >>>
> >>> 1) client sends a request to a server
> >>> 2) client is partitioned from the server
> >>> 3) server starts trying to send response
> >>> 4) client reconnects to a different server
> >>> 5) partition heals
> >>> 6) server gets a reset from client
> >>>
> >>> at step 6 i don't think you want to delete the ephemeral nodes.
> >>>
> >>> ben
> >>>
> >>> On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
> >>>
> >>>
> >>>> Yes that's right. Which network issues can cause the socket to close
> >>>> without the initiating process closing the socket? In my limited
> >>>> experience in this area network issues were more prone to leave dead
> >>>> sockets open rather than vice versa so I don't know what to look out
> >>>> for.
> >>>>
> >>>> Thanks,
> >>>> Camille
> >>>>
> >>>> -Original Message-
> >>>> From: Dave Wright [mailto:wrig...@gmail.com]
> >>>> Sent: Tuesday, August 31, 2010 1:14 PM
> >>>> To: zookeeper-user@hadoop.apache.org
> >>>> Subject: Re: closing session on socket close vs waiting for timeout
> >>>>
> >>>> I think he's saying that if the socket closes because of a crash (i.e.
> >>>> not a
> >>>> normal zookeeper close request) then the session stays alive until the
> >>>> session timeout, which is of course true since ZK allows reconnection
> >>>> and
> >>>> resumption of the session in case of disconnect due to network issues.
> >>>>
> >>>> -Dave Wright
> >>>>
> >>>> On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> That doesn't sound right to me.
> >>>>>
> >>>>> Is there a Zookeeper expert in the house?
> >>>>>
> >>>>> On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]<
> >>>>> camille.fourn...@gs.com>  wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> I foolishly did not investigate the ZK code closely enough and it
> >>>>>> seems
> >>>>>> that closing the socket still waits for the session timeout to
> >>>>>> remove the
> >>>>>> session.
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >
>

Re: election recipe

2010-09-02 Thread Ted Dunning

You are correct that this simpler recipe will work for smaller populations
and correct that the complications are to avoid the herd effect.



On Thu, Sep 2, 2010 at 12:55 PM, Eric van Orsouw
wrote:

> Hi there,
>
>
>
> I would like to use zookeeper to implement an election scheme.
>
> There is a recipe on the homepage, but it is relatively complex.
>
> I was wondering what was wrong with the following pseudo code;
>
>
>
> forever {
>
>zookeeper.create -e /election 
>
>if creation succeeded then {
>
>// do the leader thing
>
>} else {
>
>// wait for change in /election using watcher mechanism
>
>}
>
> }
>
>
>
> My assumption is that the recipe is more elaborate to the eliminate the
> flood of requests if the leader falls away.
>
> But if there are only a handful of leader-candidates ,than that should not
> be a problem.
>
>
>
> Is this correct, or am I missing out on something.
>
>
>
> Thanks,
>
> Eric
>
>
>
>

Re: closing session on socket close vs waiting for timeout

2010-08-31 Thread Ted Dunning

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] <
camille.fourn...@gs.com> wrote:

> I foolishly did not investigate the ZK code closely enough and it seems
> that closing the socket still waits for the session timeout to remove the
> session.

Re: Exception causing close of session

2010-08-27 Thread Ted Dunning

Patrick,

Can you clarify what reset means?  It doesn't mean just restart, does it?

On Thu, Aug 26, 2010 at 5:05 PM, Patrick Hunt  wrote:

> > Client has seen zxid 0xfa4 our last zxid is 0x42
>
> Someone reset the zk server database without restarting the clients. As a
> result the client is "forward" in time relative to the cluster.
>
> Patrick
>
>
> On 08/26/2010 04:03 PM, Ted Yu wrote:
>
>> Hi,
>> zookeeper-3.2.2 is used out of HBase 0.20.5
>>
>> Linux sjc1-.com 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
>> x86_64 x86_64 GNU/Linux
>>
>> In hbase-hadoop-zookeeper-sjc1-cml-grid00.log, I see a lot of the
>> following:
>>
>> 2010-08-26 22:58:01,930 INFO org.apache.zookeeper.server.NIOServerCnxn:
>> closing session:0x0 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected
>> local=/10.201.9.40:2181 remote=/10.201.9.22:63316]
>> 2010-08-26 22:58:02,097 INFO org.apache.zookeeper.server.NIOServerCnxn:
>> Connected to /10.201.9.22:63317 lastZxid 4004
>> 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn:
>> Client has seen zxid 0xfa4 our last zxid is 0x42
>> 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn:
>> Exception causing close of session 0x0 due to java.io.IOException: Client
>> has seen zxid 0xfa4 our last zxid is 0x42
>>
>> If you can shed some thought on root cause, that would be great.
>>
>>

Re: What roles do "even" nodes play in the ensamble

2010-08-25 Thread Ted Dunning

Just use 3 nodes.  Life will be better.

You can configure the fourth node in the event of one of the first three
failing and bring it on line.  Then you can re-configure and restart each of
the others one at a time.  This gives you flexibility because you have 4
nodes, but doesn't decrease your reliability the way that using a four node
cluster would.  If you need to do maintenance on one node, just configure
that node out as if it had failed.

On Wed, Aug 25, 2010 at 4:26 PM, Dave Wright  wrote:

> You can certainly serve more reads with a 4th node, but I'm not sure
> what you mean by "it won't have a voting role". It still participates
> in voting for leaders as do all non-observers regardless of whether it
> is an even or odd number. With zookeeper there is no voting on each
> transaction, only leader changes.
>
> -Dave Wright
>
> On Wed, Aug 25, 2010 at 6:22 PM, Todd Nine 
> wrote:
> > Do I get any read performance increase (similar to an observer) since
> > the node will not have a voting role?
> >
> >
>

Re: Non Hadoop scheduling frameworks

2010-08-23 Thread Ted Dunning

These are pretty easy to solve with ZK.  Ephemerality, exclusive create,
atomic update and file versions allow you to implement most of the semantics
you need.

I don't know of any recipes available for this, but they would be worthy
additions to ZK.

On Mon, Aug 23, 2010 at 11:33 PM, Todd Nine  wrote:

> Solving UC1 and UC2 via zookeeper or some other framework if one is
> recommended.  We don't run Hadoop, just ZK and Cassandra as we don't have a
> need for map/reduce.  I'm searching for any existing framework that can
> perform standard time based scheduling in a distributed environment.  As I
> said earlier, Quartz is the closest model to what we're looking for, but it
> can't be used in a distributed parallel environment.  Any suggestions for a
> system that could accomplish this would be helpful.
>
> Thanks,
> Todd
>
> On 24 August 2010 11:27, Mahadev Konar  wrote:
>
> > Hi Todd,
> >  Just to be clear, are you looking at solving UC1 and UC2 via zookeeper?
> Or
> > is this a broader question for scheduling on cassandra nodes? For the
> latter
> > this probably isnt the right mailing list.
> >
> > Thanks
> > mahadev
> >
> >
> > On 8/23/10 4:02 PM, "Todd Nine"  wrote:
> >
> > Hi all,
> >  We're using Zookeeper for Leader Election and system monitoring.  We're
> > also using it for synchronizing our cluster wide jobs with  barriers.
> >  We're
> > running into an issue where we now have a single job, but each node can
> > fire
> > the job independently of others with different criteria in the job.  In
> the
> > event of a system failure, another node in our application cluster will
> > need
> > to fire this Job.  I've used quartz previously (we're running Java 6),
> but
> > it simply isn't designed for the use case we have.  I found this article
> on
> > cloudera.
> >
> > http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/
> >
> >
> > I've looked at both plugins, but they require hadoop.  We're not
> currently
> > running hadoop, we only have Cassandra.  Here are the 2 basic use cases
> we
> > need to support.
> >
> > UC1: Synchronized Jobs
> > 1. A job is fired across all nodes
> > 2. The nodes wait until the barrier is entered by all participants
> > 3. The nodes process the data and leave
> > 4. On all nodes leaving the barrier, the Leader node marks the job as
> > complete.
> >
> >
> > UC2: Multiple Jobs per Node
> > 1. A Job is scheduled for a future time on a specific node (usually the
> > same
> > node that's creating the trigger)
> > 2. A Trigger can be overwritten and cancelled without the job firing
> > 3. In the event of a node failure, the Leader will take all pending jobs
> > from the failed node, and partition them across the remaining nodes.
> >
> >
> > Any input would be greatly appreciated.
> >
> > Thanks,
> > Todd
> >
> >
>

Re: Parent nodes & multi-step transactions

2010-08-23 Thread Ted Dunning

My own opinion is that lots of these structure sorts of problems are solved
by putting the structure into a single znode.  Atomic creation and update
come for free at that point and we can even make the node ephemeral which we
can't really do if there are children.

It is tempting to use children and grand-children in ZK when this is needed,
but it is surprisingly useful to avoid this.

Take Katta as an example.  This is a sharded query systems.  The master
knows about shards that need to be handled by nodes.  Nodes come on-line and
advertise their existence.  The master assigns shards to nodes.  The nodes
download the shards and advertise that they are handling the nodes.  The
master has to handle node failures and recoveries.

The natural representation is to have the nodes signal that they are
handling a particular node by creating an ephemeral file under a per shard
directory.  This is nice because node failures cause automagical update of
the data.  The dual is also natural ... we can create shard files under node
directories.  That dual is a serious mistake, however, and it is much better
to put all the dual information in a single node file that the node itself
creates.  This allows ephemerality to maintain a correct view for us.

There are other places where this idea works well.  One such thing is a
queue of tasks.  The queue itself can be represented as several files that
contain lots of tasks instead of keeping each task in a separate file.

This doesn't eliminate all desire for transactions, but it gets rid of LOTs
of them.

On Tue, Aug 24, 2010 at 12:31 AM, Dave Wright  wrote:

> For my $0.02, I really think it would be nice if ZK supported
> "lightweight transactions". By that, I simply mean that a batch of
> create/update/delete requests could be submitted in a single request,
> and be processed atomically (if any of the requests would fail, none
> are applied).
> I know transactions have been discussed before and discarded as adding
> too much complexity, but I think a simple version of transactions
> would be extremely helpful. A significant portion of our code is
> cleanup/workarounds for the inability to make several updates
> atomically. Should the time allow for me to work on any single
> feature, that's probably the one I would pick, although I'm concerned
> that there would be resistance to accepting upstream.
>
> -Dave Wright
>
> On Mon, Aug 23, 2010 at 6:51 PM, Gustavo Niemeyer 
> wrote:
> > Hi Mahadev,
> >
> >>  Usually the paradigm I like to suggest is to have something like
> >>
> >> /A/init
> >>
> >> Every client watches for the existence of this node and this node is
> only
> >> created after /A has been initialized with the creation of /A/C or other
> >> stuff.
> >>
> >> Would that work for you?
> >
> > Yeah, this is what I referred to as "liveness nodes" in my prior
> > ramblings, but I'm a bit sad about the amount of boilerplate work that
> > will have to be done to put use something like this.  It feels like as
> > the size of the problem increases, it might become a bit hard to keep
> > the whole picture in mind.
> >
> > Here is a slightly more realistic example (still significantly
> > reduced), to give you an idea of the problem size:
> >
> > /services/wordpress/settings
> > /services/wordpress/units/wordpress-0/agent-connected
> > /services/wordpress/units/wordpress-1
> > /machines/machine-0/agent-connected
> > /machines/machine-0/units/wordpress-1
> > /machines/machine-1/units/wordpress-0
> >
> > There are quite a few dynamic nodes here which are created and
> > initialized on demand.  If we use these liveness nodes, we'll have to
> > not only set watches in several places, but also have some kind of
> > recovering daemon to heal a half-created state, and also filter
> > user-oriented feedback to avoid showing nodes which may be dead.  All
> > of that would be avoided if there was a way to have multi-step atomic
> > actions.  I'm almost pondering about a journal-like system on top of
> > the basic API, to avoid having to deal with this manually.
> >
> > --
> > Gustavo Niemeyer
> > http://niemeyer.net
> > http://niemeyer.net/blog
> > http://niemeyer.net/twitter
> >
>

Re: Session expiration caused by time change

2010-08-20 Thread Ted Dunning

Mocking the time via a utility was my thought. Mocking system itself  
is scary.


Sent from my iPhone

On Aug 20, 2010, at 1:18 PM, Benjamin Reed  wrote:

i put up a patch that should address the problem. now i need to  
write a test case. the only way i can think of is to change the call  
to System.currentTimeMillis to a utility class that calls  
System.currentTimeMillis that i can mock for testing. any better  
ideas?


ben

On 08/19/2010 03:53 PM, Ted Dunning wrote:

Put in a four letter command that will put the server to sleep for 15
seconds!

:-)

On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reedinc.com>  wrote:



i'm updating ZOOKEEPER-366 with this discussion and try to get a  
patch out.

Qing (or anyone else, can you reproduce it pretty easily?)

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

Put in a four letter command that will put the server to sleep for 15
seconds!

:-)

On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reed  wrote:

> i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out.
> Qing (or anyone else, can you reproduce it pretty easily?)
>

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

Ben's approach is really simpler.  The client already sends keep-alive
messages and we know that
some have gone missing or a time shift has happened.  Those two
possibilities are cleanly distinguished
by Ben's suggestion of comparing current time to the bucket expiration.  If
current time is significantly after
the bucket expiration, we know something strange happened and can reschedule
the next few buckets.

As Ben mentioned, this has a cleanly bounded maximum error and is very, very
simple.  He didn't mention
that it doesn't require any more information than is already known and
doesn't require any machine interaction.

On Thu, Aug 19, 2010 at 3:16 PM, Vishal K  wrote:

>
> On Thu, Aug 19, 2010 at 5:33 PM, Benjamin Reed 
> wrote:
>
> > if we can't rely on the clock, we cannot say things like "if ... for 5
> > seconds".
> >
> >
> "if ... for 5 seconds" indicates the timeout give by the socket library.
> After the timeout we can verify that the timeout received was not a side
> effect of time jump by looking at the number of ping attempts.
>
>
>
> > also, clients connect to servers, not visa-versa, so we cannot say things
> > like "server can attempt to reconnect".
> >
>
> In the scenario described below, wouldn't it be ok for the server to just
> send a ping request to see if the client is really dead?
>

Re: ZK monitoring

2010-08-19 Thread Ted Dunning

It would be nice if it took a list of servers and verified that they all
thought that they were part of the same cluster.

On Thu, Aug 19, 2010 at 1:46 PM, Patrick Hunt  wrote:

> Maybe we should have a contrib pkg for utilities such as this? I could see
> a python script that, given 1 server (might require addl 4letter words but
> this would be useful regardless), could collect such information from the
> cluster. Create a JIRA?
>
> Patrick
>
> On 08/17/2010 12:14 PM, Andrei Savu wrote:
>
>> It's not possible. You need to query all the servers in order to know
>> who is the current leader.
>>
>> It should be pretty simple to implement this by parsing the output
>> from the 'stat' 4-letter command.
>>
>> On Tue, Aug 17, 2010 at 9:50 PM, Jun Rao  wrote:
>>
>>> Hi,
>>>
>>> Is there a way to see the current leader and a list of followers from a
>>> single node in the ZK quorum? It seems that ZK monitoring (JMX, 4-letter
>>> commands) only provides info local to a node.
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>>
>>
>>
>> -- Andrei Savu
>>
>

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

Nice (modulo inverting the < in your text).

Option 2 seems very simple.  That always attracts me.

On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reed  wrote:

> yes, you are right. we could do this. it turns out that the expiration code
> is very simple:
>
>while (running) {
>currentTime = System.currentTimeMillis();
>if (nextExpirationTime > currentTime) {
>this.wait(nextExpirationTime - currentTime);
>continue;
>}
>SessionSet set;
>set = sessionSets.remove(nextExpirationTime);
>if (set != null) {
>for (SessionImpl s : set.sessions) {
>sessionsById.remove(s.sessionId); expirer.expire(s);
>}
>}
>nextExpirationTime += expirationInterval;
>}
>
> so we can detect a jump very easily: if nextExpirationTime > currentTime,
> we have jumped ahead in time.
>
> now the question is, what do we do with this information?
>
> option 1) we could figure out the jump (nextExpirationTime-currentTime is a
> good estimate) and move all of the sessions forward by that amount.
> option 2) we could converge on the time by having a policy to always wait
> at least a half a tick time.
>
> there probably are other options as well. i kind of like option 2. worst
> case is it will make the sessions expire in half the time that they should,
> but this shouldn't be too much of a problem since clients send a ping if
> they are idle for 1/3 of their session timeout.
>
> ben
>
>
> On 08/19/2010 08:39 AM, Ted Dunning wrote:
>
>> True.  But it knows that there has been a jump.
>>
>> Quiet time can be distinguished from clock shift by assuming that members
>> of
>> the cluster
>> don't all jump at the same time.
>>
>> I would imagine that a "recent clock jump" estimate could be kept and
>> buckets that would
>> otherwise expire due to such a jump could be given a bit of a second lease
>> on life, delaying
>> all of their expiration.  Since time-outs are relatively short, the server
>> would be able to forget
>> about the bump very shortly.
>>
>> On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reed
>>  wrote:
>>
>>
>>
>>> if we try to use network messages to detect and correct the situation, it
>>> seems like we would recreate the problem we are having with ntp, since
>>> that
>>> is exactly what it does.
>>>
>>>
>>>
>>
>

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

True.  But it knows that there has been a jump.

Quiet time can be distinguished from clock shift by assuming that members of
the cluster
don't all jump at the same time.

I would imagine that a "recent clock jump" estimate could be kept and
buckets that would
otherwise expire due to such a jump could be given a bit of a second lease
on life, delaying
all of their expiration.  Since time-outs are relatively short, the server
would be able to forget
about the bump very shortly.

On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reed  wrote:

> if we try to use network messages to detect and correct the situation, it
> seems like we would recreate the problem we are having with ntp, since that
> is exactly what it does.
>

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

a) that only provides monotonic time, not smooth time

b) that is C, the server is Java

Could be hard to get the benefit we need.

On Thu, Aug 19, 2010 at 8:27 AM, Martin Waite  wrote:

> Hi,
>
> I'm not sure if you mean the timers I was on about earlier.  If so,
> http://linux.die.net/man/3/clock_gettime
>
> Sufficiently recent versions of GNU libc and the Linux kernel support the
> following clocks:
>
> ...
> *CLOCK_MONOTONIC* Clock that cannot be set and represents monotonic time
> since some unspecified starting point. Although re-reading that now, I
> might
> have applied wishful thinking to my interpretation.
>
> regards,
> Martin
>
>
> On 19 August 2010 16:13, Benjamin Reed  wrote:
>
> > do you have a pointer to those timers?
> >
> > thanx
> > ben
> >
> >
> > On 08/18/2010 11:58 PM, Martin Waite wrote:
> >
> >>  On Linux, I believe that there is a class of timers
> >> provided that is immune to this, but I doubt that there is a platform
> >> independent way of coping with this.
> >>
> >>
> >
> >
>

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

Another option would be for the cluster to compare times and note when one
member seems to be lagging.  Restoration of that
lag would then be less remarkable.

I believe that the pattern of these problems is a slow slippage behind and a
sudden jump forward.

On Thu, Aug 19, 2010 at 7:51 AM, Vishal K  wrote:

> Hi,
>
> I remember Ben had opened a jira for clock jumps earlier:
> https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to
> have clocks jump forward in virtualized environments.
>
> It is desirable to modify ZooKeeper to handle this situation (as much as
> possible) internally. It would need to be done for both client - server
> connections and server - server connections. One obvious solution is to
> retry a few times (send ping) after getting a timeout. Another way is to
> count the number of pings that have been sent after receiving the timeout.
> If number of pings do not match the expected number (say 5 ping attempt
> should be finished for a 5 sec timeout), then wait till all the pings are
> finished. In effect do not completely rely on the clock. Any comments?
>
> -Vishal
>
> On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan  wrote:
>
> > Oh.. our servers are also running in a virtualized environment.
> >
> > On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite 
> wrote:
> >
> > > Hi,
> > >
> > > I have tripped over similar problems testing Red Hat Cluster in
> > virtualised
> > > environments.  I don't know whether recent linux kernels have improved
> > > their
> > > interaction with VMWare, but in our environments clock drift caused by
> > lost
> > > ticks can be substantial, requiring NTP to sometimes jump the clock
> > rather
> > > than control acceleration.   In one of our internal production rigs,
> the
> > > local NTP servers themselves were virtualised - causing absolute mayhem
> > > when
> > > heavy loads hit the other guests on the same physical hosts.
> > >
> > > The effect on RHCS (v2.0) is quite dramatic.  A forward jump in time by
> > 10
> > > seconds always causes a member to prematurely time-out on a network
> read,
> > > causing the member to drop out and trigger a cluster reconfiguration.
> > > Apparently NTP is integrated with RHCS version 3, but I don't know what
> > is
> > > meant by that.
> > >
> > > I guess this post is not entirely relevent to ZK, but I am just making
> > the
> > > point that virtualisation (of NTP servers and or clients) can cause
> > > repeated
> > > premature timeouts.  On Linux, I believe that there is a class of
> timers
> > > provided that is immune to this, but I doubt that there is a platform
> > > independent way of coping with this.
> > >
> > > My 2p.
> > >
> > > regards,
> > > Martin
> > >
> > > On 18 August 2010 16:53, Patrick Hunt  wrote:
> > >
> > > > Do you expect the time to be "wrong" frequently? If ntp is running it
> > > > should never get out of sync more than a small amount. As long as
> this
> > is
> > > > less than ~your timeout you should be fine.
> > > >
> > > > Patrick
> > > >
> > > >
> > > > On 08/18/2010 01:04 AM, Qing Yan wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >>The testcase is fairly simple. We have a client which connects to
> > ZK,
> > > >> registers an ephemeral node and watches on it. Now change the client
> > > >> machine's time - session killed..
> > > >>
> > > >>Here is the log:
> > > >>
> > > >> *2010-08-18 04:24:57,782 INFO
> > > >> com.taobao.timetunnel2.cluster.service.AgentService: Host name
> > > >> kgbtest1.corp.alimama.com
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >> environment:zookeeper.version=3.2.2-888565, built on 12/08/2009
> 21:51
> > > GMT
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >> environment:host.name=kgbtest1.corp.alimama.com
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >> environment:java.version=1.6.0_13
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >> environment:java.vendor=Sun Microsystems Inc.
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >> environment:java.home=/usr/java/jdk1.6.0_13/jre
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >>
> > > >>
> > >
> >
> environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar
> > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > > >>
> > > >>
> > >
> >
> environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/s

Re: Zookeeper stops

2010-08-19 Thread Ted Dunning

Also, /tmp is not a great place to keep things that are intended for
persistence.

On Thu, Aug 19, 2010 at 7:34 AM, Mahadev Konar wrote:

> Hi Wim,
>  It mostly looks like that zookeeper is not able to create files on the
> /tmp filesystem. Is there is a space shortage or is it possible the file is
> being deleted as its being written to?
>
> Sometimes admins have a crontab on /tmp that cleans up the /tmp filesystem.
>
> Thanks
> mahadev
>
>
> On 8/19/10 1:15 AM, "Wim Jongman"  wrote:
>
> Hi,
>
> I have a zookeeper server running that can sometimes run for days and then
> quits:
>
> Is there somebody with a clue to the problem?
>
> I am running 64 bit Ubuntu with
>
> java version "1.6.0_18"
> OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
>
> Zookeeper 3.3.0
>
> The log below has some context before it shows the fatal error. Our
> component.id=40676 indicates that it is the 40676th time that I ask ZK to
> publish this information. It has been seen to go up to half a million
> before
> stopping.
>
> Regards,
>
> Wim
>
> ZooDiscovery> Service Unpublished: Aug 18, 2010 11:17:28 PM.
> ServiceInfo[uri=osgiservices://
>
> 188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice
> ,
>
> osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService,
> ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id
> =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@68a1e081,
> component.name=Star Wars Quotes Service, ecf.sp.ect=ecf.generic.server,
> component.id=40676,
>
> ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5b9a6ad1
> }]]
> ZooDiscovery> Service Published: Aug 18, 2010 11:17:29 PM.
> ServiceInfo[uri=osgiservices://
>
> 188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice
> ,
>
> osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService,
> ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id
> =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@71bfa0a4,
> component.name=Eclipse Twitter, ecf.sp.ect=ecf.generic.server,
> component.id=40677,
>
> ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5bcba953
> }]]
> [log;+0200 2010.08.18
>
> 23:17:29:545;INFO;org.eclipse.ecf.remoteservice;org.eclipse.core.runtime.Status[plugin=org.eclipse.ecf.remoteservice;code=0;message=No
> async remote service interface found with
> name=org.eclipse.ecf.services.quotes.QuoteServiceAsync for proxy service
>
> class=org.eclipse.ecf.services.quotes.QuoteService;severity2;exception=null;children=[]]]
> 2010-08-18 23:17:37,057 - FATAL [Snapshot Thread:zookeeperser...@262] -
> Severe unrecoverable error, exiting
> java.io.FileNotFoundException: /tmp/zookeeperData/version-2/snapshot.13e2e
> (No such file or directory)
>at java.io.FileOutputStream.open(Native Method)
>at java.io.FileOutputStream.(FileOutputStream.java:209)
>at java.io.FileOutputStream.(FileOutputStream.java:160)
>at
>
> org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:224)
>at
>
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:211)
>at
>
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:260)
>at
>
> org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java:120)
> ZooDiscovery> Service Unpublished: Aug 18, 2010 11:17:37 PM.
> ServiceInfo[uri=osgiservices://
>
> 188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice
> ,
>
> osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService,
> ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id
> =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@71bfa0a4,
> component.name=Eclipse Twitter, ecf.sp.ect=ecf.generic.server,
> component.id=40677,
>
> ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5bcba953

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning

You can always increase your timeouts a bit.

On Thu, Aug 19, 2010 at 12:52 AM, Qing Yan  wrote:

> Oh.. our servers are also running in a virtualized environment.
>
> On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite  wrote:
>
> > Hi,
> >
> > I have tripped over similar problems testing Red Hat Cluster in
> virtualised
> > environments.  I don't know whether recent linux kernels have improved
> > their
> > interaction with VMWare, but in our environments clock drift caused by
> lost
> > ticks can be substantial, requiring NTP to sometimes jump the clock
> rather
> > than control acceleration.   In one of our internal production rigs, the
> > local NTP servers themselves were virtualised - causing absolute mayhem
> > when
> > heavy loads hit the other guests on the same physical hosts.
> >
> > The effect on RHCS (v2.0) is quite dramatic.  A forward jump in time by
> 10
> > seconds always causes a member to prematurely time-out on a network read,
> > causing the member to drop out and trigger a cluster reconfiguration.
> > Apparently NTP is integrated with RHCS version 3, but I don't know what
> is
> > meant by that.
> >
> > I guess this post is not entirely relevent to ZK, but I am just making
> the
> > point that virtualisation (of NTP servers and or clients) can cause
> > repeated
> > premature timeouts.  On Linux, I believe that there is a class of timers
> > provided that is immune to this, but I doubt that there is a platform
> > independent way of coping with this.
> >
> > My 2p.
> >
> > regards,
> > Martin
> >
> > On 18 August 2010 16:53, Patrick Hunt  wrote:
> >
> > > Do you expect the time to be "wrong" frequently? If ntp is running it
> > > should never get out of sync more than a small amount. As long as this
> is
> > > less than ~your timeout you should be fine.
> > >
> > > Patrick
> > >
> > >
> > > On 08/18/2010 01:04 AM, Qing Yan wrote:
> > >
> > >> Hi,
> > >>
> > >>The testcase is fairly simple. We have a client which connects to
> ZK,
> > >> registers an ephemeral node and watches on it. Now change the client
> > >> machine's time - session killed..
> > >>
> > >>Here is the log:
> > >>
> > >> *2010-08-18 04:24:57,782 INFO
> > >> com.taobao.timetunnel2.cluster.service.AgentService: Host name
> > >> kgbtest1.corp.alimama.com
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51
> > GMT
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:host.name=kgbtest1.corp.alimama.com
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.version=1.6.0_13
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.vendor=Sun Microsystems Inc.
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.home=/usr/java/jdk1.6.0_13/jre
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >>
> > >>
> >
> environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >>
> > >>
> >
> environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.io.tmpdir=/tmp
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.compiler=
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.name=Linux
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.arch=amd64
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.version=2.6.18-164.el5
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.name=admin
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.home=/home/admin
> > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.dir=/home/admin/TimeTunnel2/cluster/log
> > >> 2010-08-18 04:24:57,790 INFO org.apache.zookeeper.ZooKeeper:
> Initiating
> > >> client connection, connectString=xentest10-vm5.corp.alimama.com:2181,
> > >> xentest10-vm6.corp.alimama.com:2181,
> xentest10-vm9.c

Re: Session expiration caused by time change

2010-08-18 Thread Ted Dunning

If NTP is changing your time by more than a few milliseconds then you have
other problems (big ones).

On Wed, Aug 18, 2010 at 1:04 AM, Qing Yan  wrote:

> I guess ZK might rely on timestamp to  keep sessions alive, but we have
> NTP daemon running so machine time can get changed
> automatically, is there a conflict?
>

Re: Weird ephemeral node issue

2010-08-17 Thread Ted Dunning

Uncharacteristically, I think that Ben's comments could use a little bit of
amplification.

First, ZK is designed with certain guarantees in mind and almost all
operational characteristics flow logically from these guarantees.

The guarantee that Ben mentioned here in passing is that if a client gets
session expiration, it is *guaranteed* that the ephemerals have been cleaned
up.  This guarantee is what drives the notification of session expiration
after reconnection since while the client is disconnected, it cannot know if
the cluster is operating correctly or not and thus cannot know if the
ephemerals have been cleaned up yet.  The only way to have certain knowledge
that the cluster has cleaned up the ephemerals is to get back in touch with
an operating cluster.

The client is not completely in the dark.  As Ben implied, it can know that
the cluster is unavailable (it got a ConnectionLoss event, after all).
 While the cluster is unavailable and before it gets a session expiration
notification, the client can go into safe mode.

The moral of this story is that to get the most out of ZK, it is best to
adopt the same guarantee based design process that drove ZK in the first
place.  The first step is that you have to decide what guarantees that you
want to provide and then work from ZK's guarantees to get to yours.

In the classic leader-election use of ZK, the key guarantee that we want is:

- the number of leaders is less than or equal to 1

Note that you can't guarantee that the number == 1, because other stuff
could happen.  This has nothing to do with ZK.

The pertinent ZK guarantees are:

- an ephemeral file can only be created by a single session

- deletion of an ephemeral file due to loss of client connection will occur
after the client gets a connection loss

- deletion of an ephemeral file will precede delivery of a session
expiration event to the owner

Phrased in terms of CSP-like constructs, the client has events BecomeMaster,
EnterSafeMode, ExitSafeMode, RelinquishMaster and Crash that must occur
according to this grammar:

client := (
   (BecomeMaster; (EnterSafeMode; ExitSafeMode)*;
EnterSafeMode?; RelinquishMaster)
 | (BecomeMaster; (EnterSafeMode; ExitSafeMode)*; EnterSafeMode?; Crash)
 | Crash
 )*

To get the guarantees that we want, we can require the client to only do
BecomeMaster after it creates an ephemeral file and require it to either
Crash, RelinquishMaster or EnterSafeMode before that ephemeral file is
deleted.  The only way that we can do that is to immediately do
EnterSafeMode on connection loss and then do RelinquishMaster on session
expiration or ExitSafeMode on connection restored.  It is involved, but you
can actually do a proof of correctness from this that shows that your
guarantee will be honored even in the presence of ZK or the client crashing
or being partitioned.

On Tue, Aug 17, 2010 at 9:26 AM, Benjamin Reed  wrote:

> there are two things to keep in mind when thinking about this issue:
>
> 1) if a zk client is disconnected from the cluster, the client is
> essentially in limbo. because the client cannot talk to a server it cannot
> know if its session is still alive. it also cannot close its session.
>
> 2) the client only finds out about session expiration events when the
> client reconnects to the cluster. if zk tells a client that its session is
> expired, the ephemerals that correspond to that session will already be
> cleaned up.
>
> one of the main design points about zk is that zk only gives correct
> information. if zk cannot give correct information, it basically says "i
> don't know". connection loss exceptions and disconnected states are
> basically "i don't know".
>
> generally applications we design go into a "safe" mode, meaning they may
> serve reads but reject changes, when disconnected from zk and only kill
> themselves when they find out their session has expired.
>
> ben
>
> ps - session information is replicated to all zk servers, so if a leader
> dies, all replicas know the sessions that are currently active and their
> timeouts.
>
> On 08/16/2010 09:03 PM, Ted Dunning wrote:
>
>> Ben or somebody else will have to repeat some of the detailed logic for
>> this, but it has
>> to do with the fact that you can't be sure what has happened during the
>> network partition.
>> One possibility is the one you describe, but another is that the partition
>> happened because
>> a majority of the ZK cluster lost power and you can't see the remaining
>> nodes.  Those nodes
>> will continue to serve any files in a read-only fashion.  If the partition
>> involves you losing
>> contact with the entire cluster at the same time a partition of the
>> cluster
>> into a quorum and
>> a minority happens, then your ephemeral files

Re: A question about Watcher

2010-08-16 Thread Ted Dunning

Almost never.  There was a bug a while back that could have conceivably
caused that under rare circumstances, but I don't know of any current
mechanism for this lossage that you are asking about.

On Mon, Aug 16, 2010 at 6:34 PM, Qian Ye  wrote:

> My question is, if the master failed, does that means some session
> information will definitely be lost?
>

Re: Weird ephemeral node issue

2010-08-16 Thread Ted Dunning

Ben or somebody else will have to repeat some of the detailed logic for
this, but it has
to do with the fact that you can't be sure what has happened during the
network partition.
One possibility is the one you describe, but another is that the partition
happened because
a majority of the ZK cluster lost power and you can't see the remaining
nodes.  Those nodes
will continue to serve any files in a read-only fashion.  If the partition
involves you losing
contact with the entire cluster at the same time a partition of the cluster
into a quorum and
a minority happens, then your ephemeral files could continue to exist at
least until the breach
in the cluster itself is healed.

Suffice it to say that there are only a few strategies that leave you with a
coherent picture
of the universe.  Importantly, you shouldn't assume that the ephemerals will
disappear at
the same time as the session expiration event is delivered.

On Mon, Aug 16, 2010 at 8:31 PM, Qing Yan  wrote:

> Ouch, is this the current ZK behavior? This is unexpected, if the
> client get partitioned from ZK cluster, he should
> get notified and take some action(e.g. commit suicide) otherwise how
> to tell a ephemeral node is really
> up or down? Zombie can create synchronization nightmares..
>
>
>
> On Mon, Aug 16, 2010 at 7:22 PM, Dave Wright  wrote:
> > Another possible cause for this that I ran into recently with the c
> client -
> > you don't get the session expired notification until you are reconnected
> to
> > the quorum and it informs you the session is lost.  If you get
> disconnected
> > and can't reconnect you won't get the notification.  Personally I think
> the
> > client api should track the session expiration time locally and
> information
> > you once it's expired.
> >
> > On Aug 16, 2010 2:09 AM, "Qing Yan"  wrote:
> >
> > Hi Ted,
> >
> >  Do you mean GC problem can prevent delivery of SESSION EXPIRE event?
> > Hum...so you have met this problem before?
> > I didn't see any OOM though, will look into it more.
> >
> >
> > On Mon, Aug 16, 2010 at 12:46 PM, Ted Dunning 
> wrote:
> >> I am assuming that y...
> >
>

Re: A question about Watcher

2010-08-16 Thread Ted Dunning

I should correct this.  The watchers will deliver a session expiration
event, but since the connection is closed at that point no further
events will be delivered and the cluster will remove them.  This is as good
as the watchers disappearing.

On Mon, Aug 16, 2010 at 9:20 AM, Ted Dunning  wrote:

> The other is session expiration.  Watchers do not survive this.  This
> happens when a client does not provide timely
> evidence that it is alive and is marked as having disappeared by the
> cluster.
>

Re: A question about Watcher

2010-08-16 Thread Ted Dunning

There are two different concepts.  One is connection loss.  Watchers survive
this and the client automatically connects
to another member of the ZK cluster.

The other is session expiration.  Watchers do not survive this.  This
happens when a client does not provide timely
evidence that it is alive and is marked as having disappeared by the
cluster.

On Mon, Aug 16, 2010 at 9:04 AM, Qian Ye  wrote:

> Hi all:
>
> Will the watchers of a client be losed when the client disconnects from a
> Zookeeper server? It is said at
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkWatchesthat
> "
> *When a client reconnects, any previously registered watches will be
> reregistered and triggered if needed. In general this all occurs
> transparently.*" It means that we need not to do any extra things about
> watchers if a client disconnected from Zookeeper server A, and reconnect to
> Zookeeper server B, doesn't it? Or I should reregistered all the watchers
> if
> this kind of reconnection happened?
>
> thx~
> --
> With Regards!
>
> Ye, Qian
>

Re: Weird ephemeral node issue

2010-08-16 Thread Ted Dunning

No.  I meant that GC can cause your client to appear to be unresponsive
until the session expires.

Can you post some ZK server logs?  And some client GC logs?

On Sun, Aug 15, 2010 at 11:08 PM, Qing Yan  wrote:

> Hi Ted,
>
>  Do you mean GC problem can prevent delivery of SESSION EXPIRE event?
> Hum...so you have met this problem before?
> I didn't see any OOM though, will look into it more.
>
> On Mon, Aug 16, 2010 at 12:46 PM, Ted Dunning 
> wrote:
> > I am assuming that you are using ZK from java.
> >
> > Very likely you are having GC problems.
> >
> > Turn on verbose GC logging and see what is happening.  You may also want
> to
> > change the session timeout values.
> >
> > It is very common for the use of ZK to highlight problems that you didn't
> > know that you had.
> >
> > On Sun, Aug 15, 2010 at 8:51 PM, Qing Yan  wrote:
> >
> >> We started using ZK in production recently and run into some problems.
> >> The user case is simple we have a central
> >> monitor checks the ephermenal nodes created by distributed apps, if
> >> the node dissappear, corresponding app will get restarted. Each app
> >> will also handle SESSION_EXPIRE by shutting itself down...
> >>
> >> Whats happening now is sometimes the central monitor will try to
> >> restart the app, in the mean time the app runs fine and sees no sign
> >> of SESSION_EXPIRED. Any clue what's going on here?
> >>
> >> Thanks
> >>
> >
>

Re: Weird ephemeral node issue

2010-08-15 Thread Ted Dunning

I am assuming that you are using ZK from java.

Very likely you are having GC problems.

Turn on verbose GC logging and see what is happening.  You may also want to
change the session timeout values.

It is very common for the use of ZK to highlight problems that you didn't
know that you had.

On Sun, Aug 15, 2010 at 8:51 PM, Qing Yan  wrote:

> We started using ZK in production recently and run into some problems.
> The user case is simple we have a central
> monitor checks the ephermenal nodes created by distributed apps, if
> the node dissappear, corresponding app will get restarted. Each app
> will also handle SESSION_EXPIRE by shutting itself down...
>
> Whats happening now is sometimes the central monitor will try to
> restart the app, in the mean time the app runs fine and sees no sign
> of SESSION_EXPIRED. Any clue what's going on here?
>
> Thanks
>

Re: How to handle "Node does not exist" error?

2010-08-12 Thread Ted Dunning

On Thu, Aug 12, 2010 at 4:57 PM, Dr Hao He  wrote:

> hi, Ted,
>
> I am a little bit confused here.  So, is the node inconsistency problem
> that Vishal and I have seen here most likely caused by configurations or
> embedding?
>
> If it is the former, I'd appreciate if you can point out where those silly
> mistakes have been made and the correct way to embed ZK.
>

I think it is likely due to misconfiguration, but I don't know what the
issue is exactly.  I think that another poster suggested that you ape the
normal ZK startup process more closely.  That sounds good but it may be
incompatible with your goals of integrating all configuration into a single
XML file and not using the normal ZK configuration process.

Your thought about forking ZK is a good one since there are calls to
System.exit() that could wreak havoc.

> Although I agree with your comments about the architectural issues that
> embedding may lead to and we are aware of those,  I do not agree that
> embedding will always lead to those issues.

I agree that embedding won't always lead to those issues and your
application is a reasonable counter-example.  As is common, I think that the
exception proves the rule since your system is really just another way to
launch an independent ZK cluster rather than an example of ZK being embedded
into an application.

Re: How to handle "Node does not exist" error?

2010-08-12 Thread Ted Dunning

I am not saying that the API shouldn't support embedded ZK.

I am just saying that it is almost always a bad idea.  It isn't that I am
asking you to not do it, it is just that I am describing the experience I
have had and that I have seen others have.  In a nutshell, embedding leads
to problems and it isn't hard to see why.

On Thu, Aug 12, 2010 at 3:02 PM, Vishal K  wrote:

> 2. With respect to Ted's point about backward compatibility, I would
> suggest
> to take an approach of having an API to support embedded ZK instead of
> asking users to not embed ZK.
>

Re: How to handle "Node does not exist" error?

2010-08-12 Thread Ted Dunning

It doesn't.

But running a ZK cluster that is incorrectly configured can cause this
problem and configuring ZK using setters is likely to be subject to changes
in what configuration is needed.  Thus, your style of code is more subject
to decay over time than is nice.

The rest of my comments detail *other* reasons why embedding a coordination
layer in the code being coordinated is a bad idea.

On Thu, Aug 12, 2010 at 6:33 AM, Vishal K  wrote:

> Hi Ted,
>
> Can you explain why running ZK in embedded mode can cause znode
> inconsistencies?
> Thanks.
>
> -Vishal
>
> On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning 
> wrote:
>
> > Try running the server in non-embedded mode.
> >
> > Also, you are assuming that you know everything about how to configure
> the
> > quorumPeer.  That is going to change and your code will break at that
> time.
> >  If you use a non-embedded cluster, this won't be a problem and you will
> be
> > able to upgrade ZK version without having to restart your service.
> >
> > My own opinion is that running an embedded ZK is a serious architectural
> > error.  Since I don't know your particular situation, it might be
> > different,
> > but there is an inherent contradiction involved in running a coordination
> > layer as part of the thing being coordinated.  Whatever your software
> does,
> > it isn't what ZK does.  As such, it is better to factor out the ZK
> > functionality and make it completely stable.  That gives you a much
> simpler
> > world and will make it easier for you to trouble shoot your system.  The
> > simple fact that you can't take down your service without affecting the
> > reliability of your ZK layer makes this a very bad idea.
> >
> > The problems you are having now are only a preview of what this
> > architectural error leads to.  There will be more problems and many of
> them
> > are likely to be more subtle and lead to service interruptions and lots
> of
> > wasted time.
> >
> > On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He  wrote:
> >
> > > hi, Ted and Mahadev,
> > >
> > >
> > > Here are some more details about my setup:
> > >
> > > I run zookeeper in the embedded mode with the following code:
> > >
> > >quorumPeer = new QuorumPeer();
> > >
> > >  quorumPeer.setClientPort(getClientPort());
> > >quorumPeer.setTxnFactory(new
> > > FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir(;
> > >
> > >  quorumPeer.setQuorumPeers(getServers());
> > >
> > >  quorumPeer.setElectionType(getElectionAlg());
> > >
>  quorumPeer.setMyid(getServerId());
> > >
> > >  quorumPeer.setTickTime(getTickTime());
> > >
> > >  quorumPeer.setInitLimit(getInitLimit());
> > >
> > >  quorumPeer.setSyncLimit(getSyncLimit());
> > >
> > >  quorumPeer.setQuorumVerifier(getQuorumVerifier());
> > >
> > >  quorumPeer.setCnxnFactory(cnxnFactory);
> > >quorumPeer.start();
> > >
> > >
> > > The configuration values are read from the following XML document for
> > > server 1:
> > >
> > >  > > serverId="1">
> > >  
> > >  
> > >  
> > > 
> > >
> > >
> > > The other servers have the same configurations except their ids being
> > > changed to 2 and 3.
> > >
> > > The error occurred on server 3 when I batch loaded some messages to
> > server
> > > 1.  However, this error does not always happen.  I am not sure exactly
> > what
> > > trigged this error yet.
> > >
> > > I also performed the "stat" operation on one of the "No exit" node and
> > got:
> > >
> > > stat
> > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg001583
> > > Exception in thread "main" java.lang.NullPointerException
> > >at
> > > org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
> > >at
> > > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
> > >at
> > > org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
> > >at
> > > org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351)
> > >at
> org.apache.zook

Re: How to handle "Node does not exist" error?

2010-08-11 Thread Ted Dunning

0002935, msg002933, msg002140,
> msg001937,
> >> msg002143, msg002520, msg002522, msg002429,
> msg002524,
> >> msg002920, msg002035, msg0000001561, msg002134,
> msg002138,
> >> msg002925, msg002151, msg002287, msg002555,
> msg002010,
> >> msg002002, msg002290, msg001537, msg002005,
> msg002147,
> >> msg002145, msg002698, msg001592, msg001810,
> msg002690,
> >> msg002691, msg001911, msg001910, msg002693,
> msg001812,
> >> msg001817, msg001547, msg002012, msg002015,
> msg002941,
> >> msg001688, msg002018, msg002684, msg002944,
> msg001540,
> >> msg002686, msg001541, msg002946, msg002688,
> msg001584,
> >> msg002948]
> >>
> >> [zk: localhost:2181(CONNECTED) 7] delete
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
> >> Node does not exist:
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
> >>
> >> When I performed the same operations on another node, none of those
> nodes
> >> existed.
> >>
> >>
> >> Dr Hao He
> >>
> >> XPE - the truly SOA platform
> >>
> >> h...@softtouchit.com
> >> http://softtouchit.com
> >> http://itunes.com/apps/Scanmobile
> >>
> >> On 11/08/2010, at 4:38 PM, Ted Dunning wrote:
> >>
> >>> Can you provide some more information?  The output of some of the four
> >>> letter commands and a transcript of what you are doing would be very
> >>> helpful.
> >>>
> >>> Also, there is no way for znodes to exist on one node of a properly
> >>> operating ZK cluster and not on either of the other two.  Something has
> to
> >>> be wrong and I would vote for operator error (not to cast aspersions,
> it is
> >>> just that humans like you and *me* make more errors than ZK does).
> >>>
> >>> On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He 
> wrote:
> >>>
> >>>> hi, All,
> >>>>
> >>>> I have a 3-host cluster running ZooKeeper 3.2.2.  On one of the hosts,
> >>>> there are a number of nodes that I can "get" and "ls" using zkCli.sh .
> >>>> However, when I tried to "delete" any of them, I got "Node does not
> exist"
> >>>> error.Those nodes do not exist on the other two hosts.
> >>>>
> >>>> Any idea how we should handle this type of errors and what might have
> >>>> caused this problem?
> >>>>
> >>>> Dr Hao He
> >>>>
> >>>> XPE - the truly SOA platform
> >>>>
> >>>> h...@softtouchit.com
> >>>> http://softtouchit.com
> >>>> http://itunes.com/apps/Scanmobile
> >>>>
> >>>>
> >>
> >>
> >
> >
>
>

Re: Sequence Number Generation With Zookeeper

2010-08-11 Thread Ted Dunning

Can't happen.

In a network partition, the side without a quorum can't update the file
version.

On Wed, Aug 11, 2010 at 3:11 PM, Adam Rosien  wrote:

> What happens during a network partition and different clients are
> incrementing "different" counters, and then the partition goes away?
> Won't (potentially) the same sequence value be given out to two
> clients?
>
> .. Adam
>
> On Thu, Aug 5, 2010 at 5:38 PM, Jonathan Holloway
>  wrote:
> > Hi Ted,
> >
> > Thanks for the comments.
> >
> > I might have overlooked something here, but is it also possible to do the
> > following:
> >
> > 1. Create a PERSISTENT node
> > 2. Have multiple clients set the data on the node, e.g.  Stat stat =
> > zookeeper.setData(SEQUENCE, ArrayUtils.EMPTY_BYTE_ARRAY, -1);
> > 3. Use the version number from stat.getVersion() as the sequence
> (obviously
> > I'm limited to Integer.MAX_VALUE)
> >
> > Are there any weird race conditions involved here which would mean that a
> > client would receive the wrong Stat object back?
> >
> > Many thanks again,
> > Jon.
> >
> > On 5 August 2010 16:09, Ted Dunning  wrote:
> >
> >> (b)
> >>
> >> BUT:
> >>
> >> Sequential numbering is a special case of "now".  In large diameters,
> now
> >> gets very expensive.  This is a special case of that assertion.  If
> there
> >> is
> >> a way to get away from this presumption of the need for sequential
> >> numbering, you will be miles better off.
> >>
> >> HOWEVER:
> >>
> >> ZK can do better than you suggest.  Incrementing a counter does involve
> >> potential contention, but you will very likely be able to get to pretty
> >> high
> >> rates before the optimistic locking begins to fail.  If you code your
> >> update
> >> with a few tries at full speed followed by some form of retry back-off,
> you
> >> should get pretty close to the best possible performance.
> >>
> >> You might also try building a lock with an ephemeral file before
> updating
> >> the counter.  I would expect that this will be slower than the back-off
> >> option if only because involves more transactions in ZK.  IF you wanted
> to
> >> get too complicated for your own good, you could have a secondary
> strategy
> >> flag that is only sampled by all clients every few seconds and is
> updated
> >> whenever a client needs to back-off more than say 5 steps.  If this flag
> >> has
> >> been updated recently, then clients should switch to the locking
> protocol.
> >>  You might even have several locks so that you don't exclude all other
> >> updaters, merely thin them out a bit.  This flagged strategy would run
> as
> >> fast as optimistic locking as long as optimistic locking is fast and
> then
> >> would limit the total number of transactions needed under very high
> load.
> >>
> >> On Thu, Aug 5, 2010 at 3:31 PM, Jonathan Holloway <
> >> jonathan.hollo...@gmail.com> wrote:
> >>
> >> > My so far involve:
> >> > a) Creating a node with PERSISTENT_SEQUENTIAL then deleting it - this
> >> gives
> >> > me the monotonically increasing number, but the sequence number isn't
> >> > contiguous
> >> > b) Storing the sequence number in the data portion of a persistent
> node -
> >> > then updating this (using the version number - aka optimistic
> locking).
> >> >  The
> >> > problem with this is that under high load I'm assuming there'll be a
> lot
> >> of
> >> > contention and hence failures with regards to updates.
> >> >
> >>
> >
>

Re: How to handle "Node does not exist" error?

2010-08-11 Thread Ted Dunning

, msg001540, msg002686, msg001541,  
msg002946, msg002688, msg001584, msg002948]


[zk: localhost:2181(CONNECTED) 7] delete /xpe/queues/ 
3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
Node does not exist: /xpe/queues/ 
3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948


When I performed the same operations on another node, none of those  
nodes existed.



Dr Hao He

XPE - the truly SOA platform

h...@softtouchit.com
http://softtouchit.com
http://itunes.com/apps/Scanmobile

On 11/08/2010, at 4:38 PM, Ted Dunning wrote:

Can you provide some more information?  The output of some of the  
four

letter commands and a transcript of what you are doing would be very
helpful.

Also, there is no way for znodes to exist on one node of a properly
operating ZK cluster and not on either of the other two.  Something  
has to
be wrong and I would vote for operator error (not to cast  
aspersions, it is

just that humans like you and *me* make more errors than ZK does).

On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He   
wrote:



hi, All,

I have a 3-host cluster running ZooKeeper 3.2.2.  On one of the  
hosts,
there are a number of nodes that I can "get" and "ls" using  
zkCli.sh .
However, when I tried to "delete" any of them, I got "Node does  
not exist"

error.Those nodes do not exist on the other two hosts.

Any idea how we should handle this type of errors and what might  
have

caused this problem?

Dr Hao He

XPE - the truly SOA platform

h...@softtouchit.com
http://softtouchit.com
http://itunes.com/apps/Scanmobile

Re: How to handle "Node does not exist" error?

2010-08-10 Thread Ted Dunning

Can you provide some more information?  The output of some of the four
letter commands and a transcript of what you are doing would be very
helpful.

Also, there is no way for znodes to exist on one node of a properly
operating ZK cluster and not on either of the other two.  Something has to
be wrong and I would vote for operator error (not to cast aspersions, it is
just that humans like you and *me* make more errors than ZK does).

On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He  wrote:

> hi, All,
>
> I have a 3-host cluster running ZooKeeper 3.2.2.  On one of the hosts,
> there are a number of nodes that I can "get" and "ls" using zkCli.sh .
>  However, when I tried to "delete" any of them, I got "Node does not exist"
> error.Those nodes do not exist on the other two hosts.
>
> Any idea how we should handle this type of errors and what might have
> caused this problem?
>
> Dr Hao He
>
> XPE - the truly SOA platform
>
> h...@softtouchit.com
> http://softtouchit.com
> http://itunes.com/apps/Scanmobile
>
>

Re: Sequence Number Generation With Zookeeper

2010-08-06 Thread Ted Dunning

Tell him that we will all look over your code so he gets immediate free
consulting.

On Fri, Aug 6, 2010 at 7:39 PM, David Rosenstrauch wrote:

> I'll run it by my boss next week.
>
> DR
>
>
> On 08/06/2010 07:30 PM, Mahadev Konar wrote:
>
>> Hi David,
>>  I think it would be really useful. It would be very helpful for someone
>> looking for geenrating unique tokens/generations ids ( I can think of
>> plenty
>> of applications for this).
>>
>> Please do consider contributing it back to the community!
>>
>> Thanks
>> mahadev
>>
>>
>> On 8/6/10 7:10 AM, "David Rosenstrauch"  wrote:
>>
>>  Perhaps.  I'd have to ask my boss for permission to release the code.
>>>
>>> Is this something that would be interesting/useful to other people?  If
>>> so, I can ask about it.
>>>
>>> DR
>>>
>>> On 08/05/2010 11:02 PM, Jonathan Holloway wrote:
>>>
 Hi David,

 We did discuss potentially doing this as well.  It would be nice to get
 some
 recipes for Zookeeper done for this area, if people think it's useful.
  Were
 you thinking of submitting this back as a recipe, if not then I could
 potentially work on such a recipe instead.

 Many thanks,
 Jon.


  I just ran into this exact situation, and handled it like so:
>
> I wrote a library that uses the option (b) you described above.  Only
> instead of requesting a single sequence number, you request a block of
> them
> at a time from Zookeeper, and then locally use them up one by one from
> the
> block you retrieved.  Retrieving by block (e.g., by blocks of 1 at
> a
> time) eliminates the contention issue.
>
> Then, if you're finished assigning ID's from that block, but still have
> a
> bunch of ID's left in the block, the library has another function to
> "push
> back" the unused ID's.  They'll then get pulled again in the next block
> retrieval.
>
> We don't actually have this code running in production yet, so I can't
> vouch for how well it works.  But the design was reviewed and given the
> thumbs up by the core developers on the team, and the implementation
> passes
> all my unit tests.
>
> HTH.  Feel free to email back with specific questions if you'd like
> more
> details.
>
> DR
>

Re: Sequence Number Generation With Zookeeper

2010-08-05 Thread Ted Dunning

Sounds right to me.  Much simpler as well.

On Thu, Aug 5, 2010 at 5:38 PM, Jonathan Holloway <
jonathan.hollo...@gmail.com> wrote:

> Hi Ted,
>
> Thanks for the comments.
>
> I might have overlooked something here, but is it also possible to do the
> following:
>
> 1. Create a PERSISTENT node
> 2. Have multiple clients set the data on the node, e.g.  Stat stat =
> zookeeper.setData(SEQUENCE, ArrayUtils.EMPTY_BYTE_ARRAY, -1);
> 3. Use the version number from stat.getVersion() as the sequence (obviously
> I'm limited to Integer.MAX_VALUE)
>
> Are there any weird race conditions involved here which would mean that a
> client would receive the wrong Stat object back?
>
> Many thanks again,
> Jon.
>
> On 5 August 2010 16:09, Ted Dunning  wrote:
>
> > (b)
> >
> > BUT:
> >
> > Sequential numbering is a special case of "now".  In large diameters, now
> > gets very expensive.  This is a special case of that assertion.  If there
> > is
> > a way to get away from this presumption of the need for sequential
> > numbering, you will be miles better off.
> >
> > HOWEVER:
> >
> > ZK can do better than you suggest.  Incrementing a counter does involve
> > potential contention, but you will very likely be able to get to pretty
> > high
> > rates before the optimistic locking begins to fail.  If you code your
> > update
> > with a few tries at full speed followed by some form of retry back-off,
> you
> > should get pretty close to the best possible performance.
> >
> > You might also try building a lock with an ephemeral file before updating
> > the counter.  I would expect that this will be slower than the back-off
> > option if only because involves more transactions in ZK.  IF you wanted
> to
> > get too complicated for your own good, you could have a secondary
> strategy
> > flag that is only sampled by all clients every few seconds and is updated
> > whenever a client needs to back-off more than say 5 steps.  If this flag
> > has
> > been updated recently, then clients should switch to the locking
> protocol.
> >  You might even have several locks so that you don't exclude all other
> > updaters, merely thin them out a bit.  This flagged strategy would run as
> > fast as optimistic locking as long as optimistic locking is fast and then
> > would limit the total number of transactions needed under very high load.
> >
> > On Thu, Aug 5, 2010 at 3:31 PM, Jonathan Holloway <
> > jonathan.hollo...@gmail.com> wrote:
> >
> > > My so far involve:
> > > a) Creating a node with PERSISTENT_SEQUENTIAL then deleting it - this
> > gives
> > > me the monotonically increasing number, but the sequence number isn't
> > > contiguous
> > > b) Storing the sequence number in the data portion of a persistent node
> -
> > > then updating this (using the version number - aka optimistic locking).
> > >  The
> > > problem with this is that under high load I'm assuming there'll be a
> lot
> > of
> > > contention and hence failures with regards to updates.
> > >
> >
>

Re: Sequence Number Generation With Zookeeper

2010-08-05 Thread Ted Dunning

(b)

BUT:

Sequential numbering is a special case of "now".  In large diameters, now
gets very expensive.  This is a special case of that assertion.  If there is
a way to get away from this presumption of the need for sequential
numbering, you will be miles better off.

HOWEVER:

ZK can do better than you suggest.  Incrementing a counter does involve
potential contention, but you will very likely be able to get to pretty high
rates before the optimistic locking begins to fail.  If you code your update
with a few tries at full speed followed by some form of retry back-off, you
should get pretty close to the best possible performance.

You might also try building a lock with an ephemeral file before updating
the counter.  I would expect that this will be slower than the back-off
option if only because involves more transactions in ZK.  IF you wanted to
get too complicated for your own good, you could have a secondary strategy
flag that is only sampled by all clients every few seconds and is updated
whenever a client needs to back-off more than say 5 steps.  If this flag has
been updated recently, then clients should switch to the locking protocol.
 You might even have several locks so that you don't exclude all other
updaters, merely thin them out a bit.  This flagged strategy would run as
fast as optimistic locking as long as optimistic locking is fast and then
would limit the total number of transactions needed under very high load.

On Thu, Aug 5, 2010 at 3:31 PM, Jonathan Holloway <
jonathan.hollo...@gmail.com> wrote:

> My so far involve:
> a) Creating a node with PERSISTENT_SEQUENTIAL then deleting it - this gives
> me the monotonically increasing number, but the sequence number isn't
> contiguous
> b) Storing the sequence number in the data portion of a persistent node -
> then updating this (using the version number - aka optimistic locking).
>  The
> problem with this is that under high load I'm assuming there'll be a lot of
> contention and hence failures with regards to updates.
>

Re: Too many "KeeperErrorCode = Session moved" messages

2010-08-05 Thread Ted Dunning

I can't comment much on this, except that this is a very odd usage pattern.

First, it isn't so unusual, but I find it a particularly bad practice to
embed ZK into your application.  The problem is that you lose a lot of the
virtues of ZK in terms of coordination if ZK goes down with your
application.  In a nutshell, what good is a coordination layer if it isn't
relatively permanent.  For instance, one important use of a coordination
layer is to avoid multiple invocations of an expensive component on a
machine.  You can't do that unless you share a ZK cluster between all
invocations of the component.  Similarly, restarting you application is much
more common than restarting ZK, but by connecting the two of these, you
again lose any ability to make configuration persistent and you lose the
ability to restart one piece of your application without restarting your ZK
at the same time.  This coupling between restarts of very different service
components is a very bad idea.  Better to have simple components that serve
simple ends.  ZK is relatively simple, very stable and does one job well.
 Why mess with that?

Secondly, why in the world are you connecting to the local ZK server?  Why
not to the cluster at large?  By connecting to only a single server you lose
all the benefits of high availability in the ZK layer because the client
can't fail-over to other servers.  Likewise, by using the local loopback
address, you make it much harder to understand your server logs. The amount
of data moved to and from a ZK cluster is typically relatively small so
there is no significant benefit to keeping the traffic local to a single
machine.

Thirdly, I suspect that associated with your somewhat idiosyncratic
architecture is some slightly odd ZK configuration.  Could you post your
configuration files?  Your log files make it sound like the cluster might be
confused about itself.

On Thu, Aug 5, 2010 at 1:20 PM, Vishal K  wrote:

>
> I am seeing a lot of these messages in our application. I would like to
> know
> if I am doing something wrong or this is a ZK bug.
>
> Setup:
> - Server environment:zookeeper.version=3.3.0-925362
> - 3 node cluster
> - Each node has few clients that connect to the local server using
> 127.0.0.1
> as the host IP.
> - The application first forms a ZK cluster. Once the ZK cluster is formed,
> each node establish sessions with local ZK servers. The clients do not know
> about remote server so sessions are always with the local server.
>
> As soon as ZK clients connected to their respective follower, the ZK leader
> starts spitting the following messages:

Re: Using watcher for being notified of children addition/removal

2010-08-02 Thread Ted Dunning

Another option besides Steve's excellent one would be to keep something like
1000 nodes in your list per znode.  Many update patterns will give you the
same number of updates, but the ZK transactions that result (getChildren,
read znode) will likely be more efficient, especially the getChildren call.

Remember, it is not a requirement that you have a one-to-one mapping between
your in-memory objects and in-zookeeper znodes.  If that works, fine.  If
not, feel free to be creative.

On Mon, Aug 2, 2010 at 7:45 AM, Steve Gury
wrote:

> Is there any recipe that would provide this feature (or a work around) ?
>

Re: node symlinks

2010-07-26 Thread Ted Dunning

I think it only mostly disappears.  If a user puts 1K files up and is placed
on a ZK cluster with 30K free slots then everything is good.  But if that
user adds 40K files, you have split or migrate that user.  I think that the
easy answer is to more than one location to look for a user's files.

On Mon, Jul 26, 2010 at 1:44 PM, Maarten Koopmans wrote:

> Also, the copy-on-new-cluster cost disappears in this scenario (bursts are
> handled better).

Re: node symlinks

2010-07-26 Thread Ted Dunning

So ZK is going to act like a file meta-data store and the number of files
might scale to a very large number.

For me, 5 billion files sounds like a large number and this seems to imply
ZK storage of 50-500GB.  If you assume 8GB usable space per machine, a fully
scaled system would require 6-60 ZK clusters.  If you start with 1 cluster
and scale by a factor of four at each expansion step, this will require 4
expansions.

I think that the easy way is to simply hash your file names to pick a
cluster.  You should have a central facility (ZK of course) that maintains a
history of hash seeds that have been used for cluster cluster configurations
that still have live files.  The process for expansion would be:

a) bring up the new clusters.

b) add a new hash seed/number of clusters.  All new files will be created
according to this new scheme.  Old files will still be in their old places.

c) start a scan of all file meta-data records on the old clusters to move
them to where they should live in the current hashing.  When this scan
finishes, you can retire the old hash seed.  Since each ZK would only
contain at most a few hundred million entries, you should be able to
complete this scan in a day or so even if you are only scanning at a rate of
a thousand entries per second.

Since the scans of the old cluster might take quite a while and you might
even have two expansions before a scan is done, finding a file will consist
of probing current and old but still potentially active locations.  This is
the cost of the move-after-expansion strategy, but it can be hard to build
consistent systems without this old/new hash idea.  Normally I recommend
micro-sharding to avoid one-by-one object motion, but that wouldn't really
work with a ZK base.

A more conventional approach would be to use Voldemort or Cassandra.
 Voldemort especially has some very nice expansion/resharding capabilities
and is very fast.  It wouldn't necessarily give you the guarantees of ZK,
but it is a pretty effective solution that avoids you having to implement
the scaling of the storage layer.

Also, the more you can store meta-data for multiple files in a single Znode,
the better off you will be in terms of memory efficiency.

On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans wrote:

>
> Hi Mahadev,
>
> My use is mapping a flat object store (like S3) to a filesystem and opening
> it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
> to a collection or a file), and is used for locking and provides the pointer
> to the actual data object in e.g. S3
>
> A "symlink" could just be dialected in the ZK node - my tree traversal can
> recurses and can be made cluster aware. That way, I don't need a special
> central table.
>
> Does this clarify? The # nodes might grow rapidly with more users, and I
> need to grow between users and filesystems.
>
> Best, Maarten
>
> On 07/26/2010 06:12 PM, Mahadev Konar wrote:
>
>> HI Maarteen,
>>   Can you elaborate on your use case of ZooKeeper? We currently don't have
>> any symlinks feature in zookeeper. The only way to do it for you would be
>> a
>> client side hash/lookup table that buckets data to different zookeeper
>> servers.
>>
>> Or you could also store this hash/lookup table in one of the zookeeper
>> clusters. This lookup table can then be cached on the client side after
>> reading it once from zookeeper servers.
>>
>> Thanks
>> mahadev
>>
>>
>> On 7/24/10 2:39 PM, "Maarten Koopmans"  wrote:
>>
>>  Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
>>> as it will provide the file system hierarchy to a flat object store so I
>>> need locking primitives and consistency. Doing that on top of Voldemort
>>> will give me a scalable version of ZK, but just slower. Might as well
>>> find a way to scale across ZK clusters.
>>>
>>> Also, I want to be able to add clusters as the number of nodes grows.
>>> Note that the #nodes will grow with the #users of the system, so the
>>> clusters can grow sequentially, hence the symlink idea.
>>>
>>> --Maarten
>>>
>>> On 07/24/2010 11:12 PM, Ted Dunning wrote:
>>>
>>>> Depending on your application, it might be good to simply hash the node
>>>> name
>>>> to decide which ZK cluster to put it on.
>>>>
>>>> Also, a scalable key value store like Voldemort or Cassandra might be
>>>> more
>>>> appropriate for your application.  Unless you need the hard-core
>>>> guarantees
>>>> of ZK, they can be better for large scale storage.
>>>>
>>>> On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans>>&g

Re: node symlinks

2010-07-24 Thread Ted Dunning

Depending on what a user needs to see, you can also have parallel structures
and select a cluster based on user number.

Your insistence on guarantees is worrisome, though.  As much as I like ZK, I
like getting rid of hard consistency requirements even more.  As I tend to
put it, the cost of "NOW" increases very rapidly with diameter of the "NOW"
that you are buying.  If you can avoid buying anything but very small "NOW"s
you will be much, much better off.

On Sat, Jul 24, 2010 at 2:39 PM, Maarten Koopmans wrote:

> Also, I want to be able to add clusters as the number of nodes grows. Note
> that the #nodes will grow with the #users of the system, so the clusters can
> grow sequentially, hence the symlink idea.

Re: node symlinks

2010-07-24 Thread Ted Dunning

Depending on your application, it might be good to simply hash the node name
to decide which ZK cluster to put it on.

Also, a scalable key value store like Voldemort or Cassandra might be more
appropriate for your application.  Unless you need the hard-core guarantees
of ZK, they can be better for large scale storage.

On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans wrote:

> Hi,
>
> I have a number of nodes that will grow larger than one cluster can hold,
> so I am looking for a way to efficiently stack clusters. One way is to have
> a zookeeper node "symlink" to another cluster.
>
> Has anybody ever done that and some tips, or alternative approaches?
> Currently I use Scala, and traverse zookeeper trees by proper tail
> recursion, so adapting the tail recursion to process "symlinks" would be my
> approach.
>
> Bst, Maarten
>

Re: Adding observers

2010-07-21 Thread Ted Dunning

It is really simpler than you can imagine.  Something like this should be
plenty sufficient.

   for h in ZK_HOSTS
   do
  ssh $h $ZK_HOME/bin/zkServer.sh restart
  sleep 5
   done

This is just something I typed in, not something I checked.  It is intended
to give you the
idea.  I will leave it to you to fix my silly errors.  :-)

Note that you probably don't need to do this to the observers since they
don't need to know about other
observers.

On Wed, Jul 21, 2010 at 10:48 AM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> Any example scripts for the rolling restart technique that anyone would be
> kind enough to share?
>
>

Re: Adding observers

2010-07-21 Thread Ted Dunning

If you have an efficient way to grab the disk state from an observer, this
will, indeed, make starting a new observer less expensive to the cluster.
 In practice, this isn't a big deal since the ZK snapshot is bounded by
memory size and transferring a few GB across the network isn't all that
painful as a one-time cost.

On Wed, Jul 21, 2010 at 10:35 AM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> (1) If I snapshot the data on other observer machines will I be able to
> bootstrap new observers with it? Given that writes are like a one time
> thing.
>

Re: Adding observers

2010-07-21 Thread Ted Dunning

On Wed, Jul 21, 2010 at 10:30 AM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

>
> (1) Is it possible to increase the number of observers in the cluster
> dynamically?
>

Not quite, but practically speaking you can do as good as this.

In general, pretty much any ZK configuration change can be done without
service interruption by using a rolling restart.

> (2) How many observers can I add given that I will seldom write into the
> cluster but will have a lot of reads coming into the system? Can I run a
> cluster with say 100 observers?
>

Others will give more authoritative answers, but I am pretty sure that the
limitation on the number of observers is strictly related to write rate x
number of observers.  This is related to the fact that writes need to come
from the current master.  It isn't hard to imagine how to write a reflector
that watches for all changes and writes these to a secondary cluster.  That
would essentially eliminate the limit on number of observers.  Something
like that may already be possible within the current system (I couldn't say
since I haven't looked into observers that much).

Re: ZK recovery questions

2010-07-21 Thread Ted Dunning

My own experiments in my own environment where ZK is being used purely for
coordination at a fairly low transaction rate (tens to hundreds of ops per
second, mostly status updates) made me feel that disk throughput would only
be detectable as an issue for pretty massively abused ZK applications.  The
impact of disk writing is surprisingly small even for pretty high throughput
cases and for moderate or low throughput, it is just not detectable.

Those seem to share a lot with the applications that could benefit from
being able to restart new servers efficiently from disk snapshot and log and
having the ability to restart the entire cluster with previous state.

On Wed, Jul 21, 2010 at 9:28 AM, Benjamin Reed  wrote:

> i did a benchmark a while back to see the effect of turning off the disk.
> (it wasn't as big as you would think.) i had to modify the code. there is an
> option to turn off the sync in the config that will get you most of the
> performance you would get by turning off the disk entirely.
>
> ben
>
> On 07/20/2010 11:01 PM, Ashwin Jayaprakash wrote:
>
>> I did try a quick test on Windows (yes, some of us use Windows :)
>>
>> I thought simply changing the "dataDir" to the "/dev/null" equivalent on
>> Windows would do the trick. It didn't work. It looks like a Java issue
>> because I noticed inconsistencies in the File API regarding this. I wrote
>> about it here -
>> http://javaforu.blogspot.com/2010/07/devnull-on-windows.html
>> devnull-on-windows .
>>
>> BTW the Windows equivalent is "nul".
>>
>> This is the error I got on Windows (below). The mkdirs() returns false. As
>> noted on my blog, it returns true for some cases.
>>
>> 2010-07-20 22:25:47,851 - FATAL [main:zookeeperserverm...@62] -
>> Unexpected
>> exception, exiting abnormally
>> java.io.IOException: Unable to create data directory nul:\version-2
>> at
>>
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.(FileTxnSnapLog.java:79)
>> at
>>
>> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:102)
>> at
>>
>> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:85)
>> at
>>
>> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:51)
>> at
>>
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:108)
>> at
>>
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:76)
>>
>>
>> Ashwin.
>>
>>
>
>

Re: getChildren() when the number of children is very large

2010-07-21 Thread Ted Dunning

On Tue, Jul 20, 2010 at 8:47 PM, André Oriani  wrote:

> Ted, just to clarify. By file you mean znode, right ?


Yes.


> So you are advising me
> to try an atomic append to znode's by first calling getData and then trying
> to conditionally set the data by using the version information obtained in
> the previous step ?
>

Exactly.

Re: getChildren() when the number of children is very large

2010-07-20 Thread Ted Dunning

Creating a new znode for each update isn't really necessary.  Just create a
file that will contain all of the updates for the next snapshot and do
atomic updates to add to the list of updates belonging to that snapshot.
 When you complete the snapshot, you will create a new file.  After a time
you can delete the old snapshot lists since they are now redundant.  This
will leave only a few snapshot files in your directory and getChildren will
be fast.  Getting the contents of the file will give you a list of
transactions to apply and when you are done with those, you can get the file
again to get any new ones before considering yourself to be up to date.  The
snapshot file doesn't need to contain the updates themselves, but instead
can contain pointers to other znodes that would actually contain the
updates.

I think that the tendency to use file creation as the basic atomic operation
is a holdover from days when we used filesystems that way.  With ZK, file
updates are ordered, atomic and you know that you updated the right version
which makes many uses of directory updates much less natural.

On Tue, Jul 20, 2010 at 7:26 PM, André Oriani <
ra078...@students.ic.unicamp.br> wrote:

> Hi folks,
>
> I was considering using Zookeeper to implement a replication protocol due
> the global order guarantee. In my case, operations are logged by creating
> persistent sequential znodes. Knowing the name of last applied znode,
> backups can identify pending operations and apply them in order. Because I
> want to allow backups to join the system at any time, I will not delete a
> znode before a checkpoint. Thus,  I can ending up with thousand of child
> nodes and consequently ZooKeeper.getChildren() calls might be very
> consuming
> since a huge list of node will be returned.
>
> I thought of using another znode to store the last created znode. So if the
> last applied znode was op-11 and last created znode was op-14, I would try
> to read op-12 and op-13. However, in order to protect against partial
> failure, I have to encode some extra information ( I am using
> -)  in the name of znodes. Thus it is
> not possible to predict their names (they'll be op- string>-). Consequently , I will have to call
> getChildren() anyway.
>
> Has somebody faced the same issue ?  Has anybody found a better solution ?
>  I was thinking of extending ZooKeeper code to have some kind of indexed
> access to child znodes, but I don`t know how easy/clever is that.
>
> Thanks,
> André
>

Re: Logger hierarchies in ZK?

2010-07-20 Thread Ted Dunning

It is pretty easy to keep configuration files in general in ZK and reload
them on change.  Very handy some days!

On Tue, Jul 20, 2010 at 5:38 PM,  wrote:

> Has anyone experimented with storing logger hierarchies in ZK? I'm looking
> for a mechanism to dynamically change logger settings across a cluster of
> daemons. An app that connects to all servers via JMX would solve the
> problem; but we have a number of subsystems that do not run on the JVM so
> JMX is not a complete solution. Thanks.
>

Re: ZK recovery questions

2010-07-19 Thread Ted Dunning

They don't auto-detect.

What is usually done is that the configurations on all the servers are
changed and they are re-started one at a time.

On Mon, Jul 19, 2010 at 8:35 PM, Ashwin Jayaprakash <
ashwin.jayaprak...@gmail.com> wrote:

> So, what happens
> when a new replacement server has to be brought in on a different
> IP/hostname? Do the older clients autodetect the new server or is this even
> supported? I suppose not.
>

Re: ZK recovery questions

2010-07-18 Thread Ted Dunning

On Sun, Jul 18, 2010 at 3:34 PM, Ashwin Jayaprakash <
ashwin.jayaprak...@gmail.com> wrote:

>
>   - If 1 out of 3 servers crashes and the log files are unrecoverable, how
>   do we provision a replacement server?
>

Just start it and it will download a snapshot from the other servers.


>
>- If the server log is recoverable but provisioning takes a long time,
>   then what happens if the old log file is far behind the current state?


If a server is very far behind, it will download a snapshot as if it knows
nothing.  This rarely takes long.


>  - If there was a temporary glitch (n/w or GC) and the replica to which
>  the client is connected breaks away from the quorum does the client
> get
>  notified? Does it stop processing client requests? Does it rejoin the
>  cluster without manual intervention?
>

Failures like this are normally invisible to the client.


>   - Do the servers really have to run with file based persistence? I saw
>   that someone wanted this in-memory mode for unit testing (ZK
> 694)
>   but there are cases where only a transient ZK service is needed. Most
>   enterprise systems have replicated Databases anyway. So, the fear of data
>   loss is minimal. If ZK logs are the only means of recovery, then this
> might
>   be harder to implement
>

ZK is not a replacement for your database and it is really, really nice to
be able to stop it and start it again.  Disk persistence helps with this
enormously.

  promising. Plain ZK API is a bit overwhelming :)
>

In practice, it is really pretty simple.  Try it out.

Re: cleanup ZK takes 40-60 seconds

2010-07-16 Thread Ted Dunning

I can't comment on the cleanup time, but I can suggest that it is normally
not a very good idea to embed Zookeeper in your application.  If your
application really is distributed, then having ZK survive the demise of any
particular instance is a really nice thing.  If ZK goes away with your
application then you lose a lot of the power of having a reliable and
independent coordination service.

I have, as they say, been there and done that.  It was not a happy
experience.

You know what you are doing much more than I possibly could, so embedding ZK
might actually make sense.  I really don't think so, though.

On Fri, Jul 16, 2010 at 6:28 PM, Vishal K  wrote:

> However, I am not sure why the cleanup should take such a long time. Can
> anyone comment on this?
>

Re: Achieving quorum with only half of the nodes

2010-07-15 Thread Ted Dunning

A small rack mounted UPS doesn't require a full-scale rebuild of
infrastructure and would get you through almost all power fail scenarios.
 If you have 5 ZK servers, put 3 on one power source and give one of them
the UPS.  Then give put the other 2 on the second power source.  If power
source A fails, you keep 2+1 servers and if power source B fails, you keep
3.

If you can stand manual intervention during an emergency, you might be able
to devise a recovery scenario.  You can put 3 and 2 on A and B as above
without a UPS.  If B fails, you are fine.  If A fails, you can go in and
reconfigure the remaining nodes to only consider themselves as the quorum.
 This is not a good plan in general because ZK is reliable enough that
people forget about it.  That means it won't be top of mind in a disaster
response.

On Thu, Jul 15, 2010 at 10:30 AM, Sergei Babovich
wrote:

> Three power sources obviously would solve the problem. Unfortunately at
> this moment it does not seem to be feasible (we will need to rebuild the
> whole existing infrastructure).

Re: Achieving quorum with only half of the nodes

2010-07-14 Thread Ted Dunning

On Wed, Jul 14, 2010 at 2:16 PM, Sergei Babovich
wrote:

> Yep... I see. This is a problem. Any better idea?
>

I think that the production of slightly elaborate quorum rules to handle
specific failure modes isn't a reasonable thing.  What you need to do in
conjunction is to estimate likelihoods of classes of failure modes and
convince yourself that you have decreased the overall failure probability.

> As an alternative option we could probably consider running single ZK node
> on EC2 - only in order to handle this specific case. Does it make sense to
> you? Is it feasible? Would it result in considerable performance impact due
> to network latency? I hope that at least in theory since quorum can be
> reached without ack from EC2 node performance impact might be manageable.
>

What about just putting a UPS on one machine in each of the two power supply
groups?

You are probably correct, though, that this outlier machine would almost
never matter to speed except when half of your machines have failed.

Re: Regarding Leader election and the limit on number of clients without performance degradation

2010-07-12 Thread Ted Dunning

Having 16 clients all wake up and ping ZK is an extremely light load.  The
warning on the recipes page had more to do with the situation where
thousands of nodes wake up at the same time.

On Mon, Jul 12, 2010 at 1:30 PM, Srikanth Bondalapati: <
sbondalap...@tagged.com> wrote:

> Hi,
>
> I am using ZooKeeper service for leader election and group management. I
> have read in the site (
>
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/recipes.html#sc_leaderElection
> )
> under the "LeaderElection" section that, if all the clients try to access
> the getChildren() when trying to become a leader, it causes a bottleneck on
> the server. But, I wanted to execute getChildren() method on all the
> clients
> that have seen a change on the parent's ZNode. So, could you please tell
> what could be the maximum number of clients that can be used without any
> performance drop on the server, when all the clients try to execute
> getChildren() method? Currently, I intend to use 16 clients cluster, and
> the
> data on each of the ZNodes is very less (say < 500 bytes).
>
> Anxiously waiting for your reply,
> Thanks & Regards,
> Srikanth.
>

Re: Frequent SessionTimeoutException[Client] - CancelledKeyException[Server]

2010-07-07 Thread Ted Dunning

What is your garbage collection situation?  This sounds like your server has
stalled.

What is your transaction rate?  Average size?

On Wed, Jul 7, 2010 at 12:00 PM, Lakshman  wrote:

> Hi Everyone,
>
> We are using zookeeper 3.3.1. And more frequently we are hitting
> CancelledKeyException after startup of application.
> Average response time is less than 50 milliseconds. But the last request
> sent is not getting any response for 20 seconds so its timing out.
>
> When analyzed, we found some possible problem with CommitRequestProcessor.
>
> Following are the series of steps happening.
>
> Client has sent some request[exists, setData, etc.]
> Server received the packet completely. That is submitted for processing.
> [nextPending]
> Client has sent some ping requests after that.
> Server has received the ping request as well and that is also queued up.
> Client is timing out as it didn't get any response from server.
> This is because ping requests are also getting queued up into
> queuedRequests.
> Its waiting for a commitedRequest for the current nextPending operation.
>
> As per my understanding pings request from client need not be queued up and
> can be processed immediately.
>
> Plz throw some pointers on this issue to me & do correct me if I went
> wrong.
> --
> Thanks & Regards
> Laxman
>
>
>

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Ted Dunning

I think that you are correct, but a real ZK person should answer this.

On Wed, Jun 30, 2010 at 4:48 PM, Bryan Thompson  wrote:

> For example, if a client registers a watch, and a state change which would
> trigger that watch occurs _after_ the client has successfuly registered the
> watch with the zookeeper quorum, is it possible that the client would not
> observe the watch trigger due to communication failure, etc., even while the
> clients session remains valid?  It sounds like the answer is "no" per the
> timeliness guarantee.  Is that correct?
>
>

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Ted Dunning

Yes.  That is true.  In particular, your link to a server (or the server
itself) can fail causing your client to switch to a different ZK server and
retry there.  This can and often does happen without you knowing.

On Wed, Jun 30, 2010 at 4:48 PM, Bryan Thompson  wrote:

> With regard to timeliness:   > The clients view of the system is
> guaranteed to be up-to-date within a certain time bound. (On the order of
> tens of seconds.) Either system changes will be seen by a client within this
> bound, or the client will detect a service outage.
>
> This seems to imply that there are retries for transient communication
> failures.  Is that true?
>

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Ted Dunning

Also this:

Once an update has been applied, it will persist from that time forward
until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been
applied. On some failures (communication errors, timeouts, etc) the client
will not know if the update has applied or not. We take steps to minimize
the failures, but the only guarantee is only present with successful return
codes. (This is called the *monotonicity condition* in Paxos.)
Any updates that are seen by the client, through a read request or
successful update, will never be rolled back when recovering from server
failures.

I think that the clear implications here are:

a) if you get a successful return code and no session expiration, your
ephemeral file is there

b) if the ephemeral files is created, you might not get the successful
return code (due to connection loss), but the ephemeral file might continue
to exist (because connection loss != session loss)

c) if you get a failure return code, your ephemeral file was not created

On Wed, Jun 30, 2010 at 4:33 PM, Patrick Hunt  wrote:

> in particular see "timeliness"
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkGuarantees
>

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Ted Dunning

Isn't this the same question that you sent this morning?

On Wed, Jun 30, 2010 at 3:36 PM, Bryan Thompson  wrote:

> Hello,
>
> I am wondering what guarantees (if any) zookeeper provides for reliable
> messaging for operation return codes up to a session timeout.  Basically, I
> would like to know whether a zookeeper client can rely on observing the
> return code for a successful operation which creates an ephemeral (or
> ephemeral sequential) znode -or- have a guarantee that its session was timed
> out and the ephemeral znode destroyed.  That is, does zookeeper provide
> guaranteed delivery of the operation return code unless the session is
> invalidated by a timeout?
>
> Thanks,
> Bryan
>

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Ted Dunning

Which API are you talking about?  C?

I think that the difference between connection loss and session expiration
might mess you up slightly in your disjunction here.

On Wed, Jun 30, 2010 at 7:45 AM, Bryan Thompson  wrote:

> Hello,
>
> I am wondering what guarantees (if any) zookeeper provides for reliable
> messaging for operation return codes up to a session timeout.  Basically, I
> would like to know whether a zookeeper client can rely on observing the
> return code for a successful operation which creates an ephemeral (or
> ephemeral sequential) znode -or- have a guarantee that its session was timed
> out and the ephemeral znode destroyed.  That is, does zookeeper provide
> guaranteed delivery of the operation return code unless the session is
> invalidated by a timeout?
>
> Thanks,
> Bryan
>

Re: Receive timed out error while starting zookeeper server

2010-06-27 Thread Ted Dunning

Are you sure that you understand that there really isn't a good concept of a
master and slave in zookeeper (at least not by default)?

Are you actually starting servers on all of your machines in your cluster?

On Sat, Jun 26, 2010 at 6:53 AM, Peeyush Kumar  wrote:

> I have a 6 node cluster (5 slaves and 1 master). I am trying to
> start the zookeper server on the cluster. when I issue this command:
> $ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \
> org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg
> I get the following error:
> 2010-06-26 18:09:17,468 - INFO  [main:quorumpeercon...@80] - Reading
> configuration from: conf/zoo.cfg
> 2010-06-26 18:09:17,483 - INFO  [main:quorumpeercon...@232] - Defaulting
> to
> majority quorums
> 2010-06-26 18:09:17,545 - INFO  [main:quorumpeerm...@118] - Starting
> quorum
> peer
> 2010-06-26 18:09:17,585 - INFO  [QuorumPeer:/0.0.0.0:2179:quorump...@514]
> -
> LOOKING
> 2010-06-26 18:09:17,589 - INFO  [QuorumPeer:/0.0.0.0:2179
> :leaderelect...@154]
> - Server address: master.cf.net/192.168.1.1:2180
>
> 2010-06-26 18:09:17,589 - INFO  [QuorumPeer:/0.0.0.0:2179
> :leaderelect...@154]
> - Server address: slave01.cf.net/192.168.1.2:2180
>
> 2010-06-26 18:09:17,792 - WARN  [QuorumPeer:/0.0.0.0:2179
> :leaderelect...@194]
> - Ignoring exception while looking for
> leader
>

Re: Switching zookeeper servers when one server is down

2010-06-25 Thread Ted Dunning

Yes.  A client should reconnect to another server if that server goes down.

If you could post a few more details, it would the community help enormously
in debugging your problems.

Notably:

- which client are you using?  (C, Java, ...)

- do you have any client or server logs you could post?  Client code?

- have you verified that the client was told about all of the server cluster
when it connected?

On Fri, Jun 25, 2010 at 8:43 AM, Colin Goodheart-Smithe <
colin.goodheartsmi...@detica.com> wrote:

> We have a system using zookeeper 3.0.1 with a Quorum of 3 servers.  When
> we shutdown one of the servers the other two stay active as expected but
> the clients which were connected to the shutdown server do not attempt
> to connect to a different zookeeper server.  I was under the impression
> that if a client could not connect to a server it would try a different
> server and iterate this process until it regained connection to a
> server.  Is this correct?
>

Re: Free Software Solution to continuously load a large number of feeds with several servers?

2010-06-19 Thread Ted Dunning

You don't say what you mean by feed.  The bixo system might be helpful to
you.  http://bixolabs.com/

On Fri, Jun 18, 2010 at 11:01 AM, Thomas Koch  wrote:

> http://stackoverflow.com/questions/3072042/free-software-solution-to-
> continuously-load-a-large-number-of-feeds-with-several
>
> I need a system that schedules and conducts the loading of a large number
> of
> Feeds. The scheduling should consider priority values for feeds provided by
> me
> and the history of past publish frequency of the feed. Later the system
> should
> make use of pubsub where available.
> Currently I'm planning to implement my own system based on HBase and
> ZooKeeper. If there isn't any free software solution by now, then I'd
> propose
> at work to develop our solution as Free Software.
>
> Thank you for any hints,
>
> Thomas Koch, http://www.koch.ro
>

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning

As usual, the ZK team provides the best feedback.

I would be bold enough to ask what kind of ec2 instances you are running on.
 Small instances are small chunks of larger machines and are sometimes
subject to competition for resources from the other tenants.

On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt  wrote:

> 3) under-provisioned virtual machines (ie vmware)
>
> ...
>
> Given that you've ruled out the gc (most common), disk utilization would be
> the next thing to check.
>

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning

Jordan,

Good step to get this info.

I have to ask, did you have your disconnect problem last night as well?
 (just checking)

What does the stat command on ZK give you for each server?

On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman <
jzimmer...@proofpoint.com> wrote:

> More on this...
>
> I ran last night with verbose GC on our client. I analyzed the GC log in
> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
> problem. NOTE we're running on Amazon EC2.
>
>

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Ted Dunning

Uh the options I was recommending were for your CLIENT.  You should have
similar settings on ZK, but it is your client that is likely to be pausing.

On Thu, Jun 10, 2010 at 4:08 PM, Jordan Zimmerman  wrote:

> The thing is, this is a test instance (on AWS/EC2) that isn't getting a lot
> of traffic. i.e. 1 zookeeper instance that we're testing with.
>
> On Jun 10, 2010, at 4:06 PM, Ted Dunning wrote:
>
> > Possibly.
> >
> > I have seen GC times of > 4 minutes on some large processes.  Better to
> set
> > the GC parameters so you don't get long pauses.
> >
> > On http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting it mentions
> using
> > the "-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC" options.  I
> recommend
> > adding
> >
> >-XX:+UseParNewGC
> >-XX:+CMSParallelRemarkEnabled
> >-XX:+DisableExplicitGC
> >
> > You may want to tune the actual parameters of the GC itself.  These
> should
> > not be used in general, but might be helpful for certain kinds of
> servers:
> >
> >-XX:MaxTenuringThreshold=6
> >-XX:SurvivorRatio=6
> >-XX:CMSInitiatingOccupancyFraction=60
> >-XX:+UseCMSInitiatingOccupancyOnly
> >
> > Finally, you should always add options for lots of GC diagnostics:
> >
> >-XX:+PrintGCDetails
> >-XX:+PrintGCTimeStamps
> >-XX:+PrintTenuringDistribution
> >
> > On Thu, Jun 10, 2010 at 3:49 PM, Jordan Zimmerman <
> jzimmer...@proofpoint.com
> >> wrote:
> >
> >> If I set my session timeout very high (1 minute) this shouldn't happen,
> >> right?
> >>
>
>

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Ted Dunning

Possibly.

I have seen GC times of > 4 minutes on some large processes.  Better to set
the GC parameters so you don't get long pauses.

On http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting it mentions using
the "-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC" options.  I recommend
adding

-XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled
-XX:+DisableExplicitGC

You may want to tune the actual parameters of the GC itself.  These should
not be used in general, but might be helpful for certain kinds of servers:

-XX:MaxTenuringThreshold=6
-XX:SurvivorRatio=6
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseCMSInitiatingOccupancyOnly

Finally, you should always add options for lots of GC diagnostics:

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution

On Thu, Jun 10, 2010 at 3:49 PM, Jordan Zimmerman  wrote:

> If I set my session timeout very high (1 minute) this shouldn't happen,
> right?
>

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Ted Dunning

This can depend on which kind of instance you invoke as well.  The smallest
instances disappear for short periods of time and that can lead to
surprises.

On Wed, Jun 9, 2010 at 3:35 PM, Lei Zhang  wrote:

> On EC2 (still CentOS as guest OS), we consistently run into zk session
> expire issue when our cluster is under heavy load. I am planning to raise
> scheduling priority of zk server, but haven't done testing.
>

Re: Simulating failures?

2010-06-04 Thread Ted Dunning

I use mock objects to create a simulated ZK object.

Alternatively, you may be able to sub-class and delegate all ZK calls.  That
would let you inject faults.

On Fri, Jun 4, 2010 at 11:28 AM, Stephen Green wrote:

> Is there any way to inject failures into the ZK client so that I can
> test without having to randomly kill servers/clients?
>

Re: zookeeper crash

2010-06-02 Thread Ted Dunning

I knew Patrick would remember to add an important detail.

On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt  wrote:

> As Ted suggested you can remove the datadir -- *only on the effected
> server* -- and then restart it.

Re: zookeeper crash

2010-06-02 Thread Ted Dunning

This looks a bit like a small bobble we had when upgrading a bit ago.

I THINK that the answer here is to mind-wipe the misbehaving node and have
it resynch from scratch from the other nodes.

Wait for confirmation from somebody real.

On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors wrote:

> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an
> attempt to get away from a client bug that was crashing my backend services.
>
> Unfortunately, this morning I had a server crash, and it brought down my
> entire cluster.  I don't have the logs leading up to the crash, because --
> argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three
> nodes, and odes two and three came back up and formed a quorum.
>
> Node one, meanwhile, does this:
>
> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
> /services/zookeeper/data/zookeeper/version-2/snapshot.a0045
> 2010-06-02 17:04:56,476 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election.
> My id =  1, Proposed zxid = 47244640287
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification:
> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server
> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir
> /services/zookeeper/data/zookeeper/version-2 snapdir
> /services/zookeeper/data/zookeeper/version-2
> 2010-06-02 17:04:56,486 - FATAL
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less
> than our epoch b
> 2010-06-02 17:04:56,486 - WARN
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following
> the leader
> java.io.IOException: Error: Epoch of leader is lower
>   at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>   at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
> java.lang.Exception: shutdown Follower
>   at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>   at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>
>
>
> All I can find is this,
> http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html,
> which implies that this state should never happen.
>
> Any suggestions?  If it happens again, I'll just have to roll everything
> back to 3.2.1 and live with the client crashes.
>
>
>
>
>

Re: Locking and Partial Failure

2010-05-31 Thread Ted Dunning

Isn't this a special case of
https://issues.apache.org/jira/browse/ZOOKEEPER-22 ?

Is there any progress on this?

On Mon, May 31, 2010 at 12:34 PM, Patrick Hunt  wrote:

> Hi Charles, any luck with this? Re the issues you found with the recipes
> please enter a JIRA, it would be good to address the problem(s) you found.
> https://issues.apache.org/jira/browse/ZOOKEEPER
>
> re use of session/thread id, might you use some sort of unique token that's
> dynamically assigned to the thread making a request on the shared session?
> The calling code could then be identified by that token in recovery cases.
>
> Patrick
>
> On 05/28/2010 08:28 AM, Charles Gordon wrote:
>
>> Hello,
>>
>> I am new to using Zookeeper and I have a quick question about the locking
>> recipe that can be found here:
>>
>>
>> http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks
>>
>> It appears to me that there is a flaw in this algorithm related to partial
>> failure, and I am curious to know how to fix it.
>>
>> The algorithm follows these steps:
>>
>>  1. Call "create()" with a pathname like
>> "/some/path/to/parent/child-lock-".
>>  2. Call "getChildren()" on the lock node without the watch flag set.
>>  3. If the path created in step (1) has the lowest sequence number, you
>> are
>> the master (skip the next steps).
>>  4. Otherwise, call "exists()" with the watch flag set on the child with
>> the
>> next lowest sequence number.
>>  5. If "exists()" returns false, go to step (2), otherwise wait for a
>> notification from the path, then go to step (2).
>>
>> The scenario that seems to be faulty is a partial failure in step (1).
>> Assume that my client program follows step (1) and calls "create()".
>> Assume
>> that the call succeeds on the Zookeeper server, but there is a
>> ConnectionLoss event right as the server sends the response (e.g., a
>> network
>> partition, some dropped packets, the ZK server goes down, etc). Assume
>> further that the client immediately reconnects, so the session is not
>> timed
>> out. At this point there is a child node that was created by my client,
>> but
>> that my client does not know about (since it never received the response).
>> Since my client doesn't know about the child, it won't know to watch the
>> previous child to it, and it also won't know to delete it. That means all
>> clients using that lock will fail to make progress as soon as the orphaned
>> child is the lowest sequence number. This state will continue until my
>> client closes it's session (which may be a while if I have a long lived
>> session, as I would like to have). Correctness is maintained here, but
>> live-ness is not.
>>
>> The only good solution I have found for this problem is to establish a new
>> session with Zookeeper before acquiring a lock, and to close that session
>> immediately upon any connection loss in step (1). If everything works, the
>> session could be re-used, but you'd need to guarantee that the session was
>> closed if there was a failure during creation of the child node. Are there
>> other good solutions?
>>
>> I looked at the sample code that comes with the Zookeeper distribution
>> (I'm
>> using 3.2.2 right now), and it uses the current session ID as part of the
>> child node name. Then, if there is a failure during creation, it tries to
>> look up the child using that session ID. This isn't really helpful in the
>> environment I'm using, where a single session could be shared by multiple
>> threads, any of which could request a lock (so I can't uniquely identify a
>> lock by session ID). I could use thread ID, but then I run the risk of a
>> thread being reused and getting the wrong lock. In any case, there is also
>> the risk that a second failure prevents me from looking up the lock after
>> a
>> connection loss, so I'm right back to an orphaned lock child, as above. I
>> could, presumably, be careful enough with try/catch logic to prevent even
>> that case, but it makes for pretty bug-prone code. Also, as a side note,
>> that code appears to be sorting the child nodes by the session ID first,
>> then the sequence number, which could cause locks to be ordered
>> incorrectly.
>>
>> Thanks for any help you can provide!
>>
>> Charles Gordon
>>
>>

Re: Zookeeper, Maven and dependencies on javax jar files

2010-05-24 Thread Ted Dunning

Same version I use.

On Mon, May 24, 2010 at 2:51 PM, Jack Orenstein  wrote:

> Ted Dunning wrote:
>
>> Which version of maven do you have?
>>
>
> 2.2.1.

Re: Zookeeper, Maven and dependencies on javax jar files

2010-05-24 Thread Ted Dunning

The only one that I think is important is the jmx which enables monitoring
of the servers.

On Mon, May 24, 2010 at 2:51 PM, Jack Orenstein  wrote:

> This at least gets me through the build/install phase. My usage of
> zookeeper is pretty minimal right now -- just one a single node. What
> features of zookeeper depend on the excluded jar files?
>
> Thanks very much for your quick response.
>

Re: Zookeeper, Maven and dependencies on javax jar files

2010-05-24 Thread Ted Dunning

Which version of maven do you have?

I have heard some versions don't follow redirects well.  You can try
deleting these defective files in your local repository under .m2 and try
again.  You may need to try with a newer maven to get things right.

Another option is to explicitly remove those dependencies since they are
optional anyway.  This sort of trick is commonly necessary with log4j.  Try
something like this (suitably adjusted, of course):

   
log4j
log4j
1.2.15


com.sun.jmx
jmxri


com.sun.jdmk
jmxtools


javax.jms
jms


javax.mail
mail







See here for possible related issue:
http://jira.codehaus.org/browse/MREPOSITORY-20?page=com.atlassian.streams.streams-jira-plugin:activity-stream-issue-tab

On Mon, May 24, 2010 at 1:35 PM, Jack Orenstein  wrote:

> I'm working on a project in the maven framework and have added a dependency
> on zookeeper. When I try to install:
>
>mvn clean install -Dmaven.test.skip=true
>...
>[INFO] Compilation failure
>
>error: error reading
> /home/jao/.m2/repository/javax/jms/jms/1.1/jms-1.1.jar; error in opening zip
> file
>error: error reading
> /home/jao/.m2/repository/com/sun/jdmk/jmxtools/1.2.1/jmxtools-1.2.1.jar;
> error in opening zip file
>error: error reading
> /home/jao/.m2/repository/com/sun/jmx/jmxri/1.2.1/jmxri-1.2.1.jar; error in
> opening zip file
>
> The named jar files contain some HTML (!), e.g.
>
>
>
>301 Moved Permanently
>
>Moved Permanently
>The document has moved http://download.java.net/maven/1/javax.jms/jars/jms-1.1.jar";>here.
>
>Apache Server at maven-repository.dev.java.net Port
> 443
>
>
> Has anyone succeeded in following this advice and getting it to work?
>
> I found this maven page:
> http://maven.apache.org/guides/mini/guide-coping-with-sun-jars.html, but
> just getting the jar files from Oracle is a pain.
>
> Any advice on how to get zookeeper and maven to coexist would be welcome.
> Thanks.
>
> Jack
>

1 2 3 >

1 - 100 of 269 matches

Mail list logo