ZooKeeper approved by Apache Board as TLP!

2010-11-22 Thread Patrick Hunt
We are now officially an Apache TLP! http://bit.ly/9czN2x As part of the process for moving out from under Hadoop and into full TLP status we need to work through the following: http://incubator.apache.org/guides/graduation.html#new-project-hand-over If you are involved with the project, esp on

Re: number of clients/watchers

2010-11-18 Thread Patrick Hunt
Camille, that's a very good question. Largest cluster I've heard about is 10k sessions. Jeremy - largest I've ever tested was a 3 server cluster with ~500 sessions. Each session created 10k znodes (100bytes each znode) and set 5 watches on each. So 5 million znodes and 25million watches. I then

Re: number of clients/watchers

2010-11-18 Thread Patrick Hunt
: Thanks Patrick - it's really nice to have those numbers and test harness basis. We're still in architecture mode so some of the details are still in flux, but I think this gives us an idea. Thanks very much. On Nov 18, 2010, at 11:51 AM, Patrick Hunt wrote: Camille, that's a very good

Re: number of clients/watchers

2010-11-18 Thread Patrick Hunt
connections across that this won't happen. I think maybe there's a JIRA out to deal with this issue, not sure what the status is. C -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, November 18, 2010 2:57 PM To: zookeeper-user@hadoop.apache.org Subject: Re

Re: Verifying Changes

2010-11-10 Thread Patrick Hunt
Perhaps something similar to what Ben detailed here? (rendezvous) http://developer.yahoo.com/blogs/hadoop/posts/2009/05/using_zookeeper_to_tame_system/ Change the key, add child znode(s) that's deleted by the notified client(s) once it's read the changed value. Some details need to be worked out

Re: Key factors for production readiness of Hedwig

2010-11-10 Thread Patrick Hunt
On Wed, Nov 10, 2010 at 10:58 AM, Erwin Tam e...@yahoo-inc.com wrote: 1. Ops tools including monitoring and administration. Command port (4 letter words) for monitoring has worked extremely well for zk. Whatever you do put the command port on a separate port, and make it a full fledged feature

[Discussion] Some proposed logging (log4j) JIRAs

2010-11-09 Thread Patrick Hunt
I wanted to highlight a couple recent JIRAs that may have impact on users (api consumers AND admins of the service) in the 3.4 timeframe. If you want to weigh in please comment on the respective jira: 1) proposal to move to slf4j (remove/replace log4j)

Re: Running cluster behind load balancer

2010-11-04 Thread Patrick Hunt
Hi Chang, thanks for the insights, if you have a few minutes would you mind updating the FAQ with some of this detail? http://wiki.apache.org/hadoop/ZooKeeper/FAQ Thanks! Patrick On Thu, Nov 4, 2010 at 6:27 AM, Chang Song tru64...@me.com wrote: Sorry. I made a mistake on retry timeout in load

Re: JUnit tests do not produce logs if the JVM crashes

2010-11-04 Thread Patrick Hunt
In addition to what Mahadev suggested you can also change the log4j.properties to log to a file rather than the CONSOLE. Although that just redirects the logs, if there is some output to stdout/stderr then junit buffering is still in play. Patrick On Thu, Nov 4, 2010 at 8:15 AM, Mahadev Konar

Re: Running cluster behind load balancer

2010-11-04 Thread Patrick Hunt
resolving to all the server addresses will probably work just as well as most DNS-based load balancers. ben On 11/04/2010 08:26 AM, Patrick Hunt wrote: Hi Chang, thanks for the insights, if you have a few minutes would you mind updating the FAQ with some of this detail? http://wiki.apache.org

Re: question about watcher

2010-11-03 Thread Patrick Hunt
...). Patrick On Wed, Nov 3, 2010 at 1:13 AM, Qian Ye yeqian@gmail.com wrote: thanks Patrick, I want to know all watches set by all clients. I would open a jira and write some design think about it later. On Tue, Nov 2, 2010 at 11:53 PM, Patrick Hunt ph...@apache.org wrote: Hi Qian Ye

Re: question about watcher

2010-11-02 Thread Patrick Hunt
Hi Qian Ye, yes you should open a JIRA for this. If you want to work on a patch we could advise you. One thing not clear to me, are you interested in just the watches set by the particular client, or all watches set by all clients? The first should be relatively easy to get, the second would be

Re: Getting a node exists code on a sequence create

2010-11-01 Thread Patrick Hunt
Hi Jeremy, this sounds like a bug to me, I don't think you should be getting the nodeexists when the sequence flag is set. Looking at the code briefly we use the parent's cversion (incremented each time the child list is changed, added/removed). Did you see this error each time you called

Re: Setting the heap size

2010-11-01 Thread Patrick Hunt
- thanks Patrick! On Thu, Oct 28, 2010 at 6:13 PM, Patrick Hunt ph...@apache.org wrote: Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org

Re: Setting the heap size

2010-11-01 Thread Patrick Hunt
- thanks Patrick! On Thu, Oct 28, 2010 at 6:13 PM, Patrick Hunt ph...@apache.org wrote: Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org

Re: Setting the heap size

2010-10-28 Thread Patrick Hunt
On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote: We are setting up a small Hadoop 13 node cluster running 1 HDFS master, 9 region severs for HBase and 3 map reduce nodes, and are just installing zookeeper to perform the HBase coordination and to manage a few

Re: Setting the heap size

2010-10-28 Thread Patrick Hunt
Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org wrote: On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote

Re: Retrying sequential znode creation

2010-10-25 Thread Patrick Hunt
they would still have to code for this corner case. Patrick On Wed, Oct 20, 2010 at 10:42 AM, Patrick Hunt phu...@gmail.com wrote: Hi Ted, Mahadev is in the best position to comment (he looked at it last) but iirc when we started looking into implementing this we immediately ran into so big

Re: Reading znodes directly from snapshot and log files

2010-10-25 Thread Patrick Hunt
Sounds like a useful utility, the closest that I know of is this: http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/server/LogFormatter.html but it just dumps the txn log. Seems like it would be cool to be able to open a shell on the datadir and query it (separate from

Re: Stale value for read request

2010-10-25 Thread Patrick Hunt
On Sat, Oct 23, 2010 at 9:03 PM, jingguo yao yaojing...@gmail.com wrote: Read requests are handled locally at each Zookeeper server. So it is possible for a read request to return a stale value even though a more recent update to the same znode has been committed. Does this statement still

Re: Unusual exception

2010-10-20 Thread Patrick Hunt
EOS means that the client closed the connection (from the point of view of the server). The server then tries to cleanup by closing the socket explicitly, in some cases that results in debug messages you see subsequent. EndOfStreamException: Unable to read additional data from client sessionid

Re: zxid integer overflow

2010-10-20 Thread Patrick Hunt
I'm not aware of sustained 1k/sec, Ben might know how long the 20k/sec test runs for (and for how long that rate is sustained). You'd definitely want to tune the GC, GC related pauses would be the biggest obstacle for this (assuming you are using a dedicated log device for the transaction logs).

Re: Testing zookeeper outside the source distribution?

2010-10-18 Thread Patrick Hunt
You might checkout a tool I built a while back to be used by operations teams deploying ZooKeeper: http://bit.ly/a6tGVJ It's really two tools actually, a smoketester and a latency tester, both of which are important to verify when deploying a new cluster. Patrick On Mon, Oct 18, 2010 at 9:50

Re: Testing zookeeper outside the source distribution?

2010-10-18 Thread Patrick Hunt
You might checkout a tool I built a while back to be used by operations teams deploying ZooKeeper: http://bit.ly/a6tGVJ It's really two tools actually, a smoketester and a latency tester, both of which are important to verify when deploying a new cluster. Patrick On Mon, Oct 18, 2010 at 9:50

Re: What does this mean?

2010-10-13 Thread Patrick Hunt
On Mon, Oct 11, 2010 at 4:16 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: tickTime = 2000, initLimit = 3000 and the data is around 11GB this is log + snapshot. So if I need to add a new observer can I transfer state from the ensemble manually before starting it? If so which files do

Re: Retrying sequential znode creation

2010-10-13 Thread Patrick Hunt
On Wed, Oct 13, 2010 at 5:58 AM, Vishal K vishalm...@gmail.com wrote: However, gets trickier because there is no explicit way (to my knowledge) to get CreateMode for a znode. As a result, we cannot tell whether a node is sequential or not. Sequentials are really just regular znodes with

Re: Changing configuration

2010-10-07 Thread Patrick Hunt
You probably want to do a rolling restart, this is preferable over restart the cluster as the service will not go down. http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6 http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6Patrick On Wed, Oct 6, 2010 at 9:49 PM, Avinash Lakshman avinash.laksh...@gmail.com

Re: snapshots

2010-10-07 Thread Patrick Hunt
Simplified: when a server comes back up it checks it's local snaps/logs to reconstruct as much of the current state as possible. It then checks with the leader to see how far behind it is, at which point it either gets a diff or gets a full snapshot (from the leader) depending on how far behind it

Re: znode inconsistencies across ZooKeeper servers

2010-10-07 Thread Patrick Hunt
= 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt ph...@apache.org wrote: Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA

Re: Too many connections

2010-10-06 Thread Patrick Hunt
On Tue, Oct 5, 2010 at 10:23 AM, Avinash Lakshman avinash.laksh...@gmail.com wrote: So shouldn't all servers in another DC just have one session? So even if I have 50 observers in another DC that should be 50 sessions established since the IP doesn't change correct? Am I missing something?

Re: znode inconsistencies across ZooKeeper servers

2010-10-06 Thread Patrick Hunt
Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the

Re: Zookeeper on 60+Gb mem

2010-10-05 Thread Patrick Hunt
Tuning GC is going to be critical, otw all the sessions will timeout (and potentially expire) during GC pauses. Patrick On Tue, Oct 5, 2010 at 1:18 PM, Maarten Koopmans maar...@vrijheid.netwrote: Yes, and syncing after a crash will be interesting as well. Off note; I am running it with a 6GB

Re: ZK compatability

2010-09-30 Thread Patrick Hunt
wrote: What about major releases going forward? Thanks, Jun On Mon, Sep 27, 2010 at 10:32 PM, Patrick Hunt ph...@apache.org wrote: In general yes, minor and bug fix releases are fully backward compatible. Patrick On Sun, Sep 26, 2010 at 9:11 PM, Jun Rao jun...@gmail.com wrote

Re: c client 0 state?

2010-09-28 Thread Patrick Hunt
Seems like a bug to me. Please enter a JIRA (if you haven't already). Thanks, Patrick On Fri, Sep 17, 2010 at 9:10 AM, Michael Xu mx2...@gmail.com wrote: Hi everyone in the c client api: Is it normal for zoo_state() to return zero (not one of the valid state consts) when it is handling

Re: zkfuse

2010-09-27 Thread Patrick Hunt
Sounds like you have an old version of autoconf, try upgrading, see similar issue here: http://www.mail-archive.com/thrift-u...@incubator.apache.org/msg00673.html http://www.mail-archive.com/thrift-u...@incubator.apache.org/msg00673.html Patrick 2010/9/24 俊贤 junx...@taobao.com Hi mahadev, My

Re: processResults

2010-09-27 Thread Patrick Hunt
I believe what the author is trying to say is that if the getdata were to fail (such as the example you give) the watch set as part of the original call will fire, and this will notify the client that the node was deleted. (call to process(event)) Patrick On Mon, Sep 27, 2010 at 6:56 PM, Milind

Re: possible bug in zookeeper ?

2010-09-14 Thread Patrick Hunt
That is unusual. I don't recall anyone reporting a similar issue, and looking at the code I don't see any issues off hand. Can you try the following? 1) on that particular zk client machine resolve the hosts zook1/zook2/zook3, what ip addresses does this resolve to? (try dig) 2) try running the

Re: Spew after call to close

2010-09-08 Thread Patrick Hunt
No worries, let us know if something else pops up. Patrick On Tue, Sep 7, 2010 at 3:10 PM, Stack st...@duboce.net wrote: Nevermind. I figured it. It was an hbase issue. We were leaking a client reference. Sorry for the noise, St.Ack On Sat, Sep 4, 2010 at 10:58 AM, Stack

Re: Spew after call to close

2010-09-08 Thread Patrick Hunt
No worries, let us know if something else pops up. Patrick On Tue, Sep 7, 2010 at 3:10 PM, Stack st...@duboce.net wrote: Nevermind. I figured it. It was an hbase issue. We were leaking a client reference. Sorry for the noise, St.Ack On Sat, Sep 4, 2010 at 10:58 AM, Stack

Re: getting created child on NodeChildrenChanged event

2010-09-07 Thread Patrick Hunt
It is good to keep things simple, but we have seen some requests related to the client api for children use cases that seem reasonable. In particular the issue of handling large numbers of children efficiently is currently a problem (queue say). We've seen proposals on this before, just no one's

Re: election recipe

2010-09-07 Thread Patrick Hunt
Hi Andrei, the answer may not be as simple as that. In the case of passive leader you might want to just wait till you're reconnected before taking any action. Connection loss indicates that you aren't currently connected to a server, it doesn't mean that you've lost leadership (if you get expired

Re: closing session on socket close vs waiting for timeout

2010-09-07 Thread Patrick Hunt
On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed

Re: closing session on socket close vs waiting for timeout

2010-09-01 Thread Patrick Hunt
Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going

Re: Logs and in memory operations

2010-08-31 Thread Patrick Hunt
On Mon, Aug 30, 2010 at 1:11 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: From my understanding when a znode is updated/created a write happens into the local transaction logs and then some in-memory data structure is updated to serve the future reads. Where in the source code can

Re: Zookeeper shell

2010-08-31 Thread Patrick Hunt
Depending on your classpath setup: java org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181 if jline jar is in your classpath (included in the zk release distribution) you'll get history, auto-complete and such. Patrick On 08/31/2010 03:08 PM, Michi Mutsuzaki wrote: Hello, I'm

Re: IllegalArgumentException excpetion : Path cannot be null

2010-08-30 Thread Patrick Hunt
The client (solr in this case) is passing a null path to the ZooKeeper.getChildren(path, ... ) call. java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at

Re: Receiving create events for self with synchronous create

2010-08-30 Thread Patrick Hunt
On line 64 are you ensuring that the ZooKeeper session is active before executing that sequence? zookeeper = new ZooKeeper(...) is async - it returns before you're actually connected to the server (you get notified of this in your watcher). If you execute this sequence quickly enough your

Re: Exception causing close of session

2010-08-30 Thread Patrick Hunt
it? On Thu, Aug 26, 2010 at 5:05 PM, Patrick Hunt ph...@apache.org wrote: Client has seen zxid 0xfa4 our last zxid is 0x42 Someone reset the zk server database without restarting the clients. As a result the client is forward in time relative to the cluster. Patrick On 08/26/2010 04

Re: Exception causing close of session

2010-08-26 Thread Patrick Hunt
Client has seen zxid 0xfa4 our last zxid is 0x42 Someone reset the zk server database without restarting the clients. As a result the client is forward in time relative to the cluster. Patrick On 08/26/2010 04:03 PM, Ted Yu wrote: Hi, zookeeper-3.2.2 is used out of HBase 0.20.5 Linux

Re: Zookeeper stops

2010-08-19 Thread Patrick Hunt
+1 on that Ted. I frequently see this issue crop up as I just rebooted my server and lost all my data ... -- many os's will cleanup tmp on reboot. :-) Patrick On 08/19/2010 07:43 AM, Ted Dunning wrote: Also, /tmp is not a great place to keep things that are intended for persistence. On Thu,

Re: Zookeeper stops

2010-08-19 Thread Patrick Hunt
No. You configure it in the server configuration file. Patrick On 08/19/2010 01:19 PM, Wim Jongman wrote: Hi, But zk does default to /tmp? Regards, Wim On Thursday, August 19, 2010, Patrick Huntph...@apache.org wrote: +1 on that Ted. I frequently see this issue crop up as I just

Re: ZK monitoring

2010-08-19 Thread Patrick Hunt
Maybe we should have a contrib pkg for utilities such as this? I could see a python script that, given 1 server (might require addl 4letter words but this would be useful regardless), could collect such information from the cluster. Create a JIRA? Patrick On 08/17/2010 12:14 PM, Andrei Savu

Re: A question about Watcher

2010-08-17 Thread Patrick Hunt
All servers keep a copy - so you can shutdown the zk service entirely (all servers) and restart it and the sessions are maintained. Patrick On 08/16/2010 06:34 PM, Qian Ye wrote: Thx Mahadev and Benjamin, it seems that I've got some misunderstanding about the client. I will check it out.

Re: How to handle Node does not exist error?

2010-08-16 Thread Patrick Hunt
Try using the logs, stat command or JMX to verify that each ZK server is indeed a leader/follower as expected. You should have one leader and n-1 followers. Verify that you don't have any standalone servers (this is the most frequent error I see - misconfiguration of a server such that it

Re: client failure detectionin ZK

2010-08-16 Thread Patrick Hunt
The session timeout is used for this: http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions Patrick On 08/16/2010 01:47 PM, Jun Rao wrote: Hi, What config parameters in ZK determine how soon a failed client is detected? Thanks, Jun

Re: Backing up zk data files

2010-08-12 Thread Patrick Hunt
On 08/11/2010 06:49 PM, Adam Rosien wrote: http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperAdmin.html#sc_dataFileManagement says that one can copy the contents of the data directory and use it on another machine. The example states the other instance is not in the server list; what

Re: zookeeper seems to hang

2010-08-12 Thread Patrick Hunt
Great bug report Ted, the stack trace in particular is very useful. It looks like a timing bug where the client is not shutting down cleanly on the close call. I reviewed the code in question but nothing pops out at me. Also the logs just show us shutting down, nothing else from zk in there.

Re: Clarification on async calls in a cluster

2010-08-11 Thread Patrick Hunt
On 08/11/2010 03:25 PM, Jordan Zimmerman wrote: If I use an async version of a call in a cluster (ensemble) what happens if the server I'm connected to goes down? Does ZK transparently resubmit the call to the next server in the cluster and call my async callback or is there something I need to

Re: Sequence Number Generation With Zookeeper

2010-08-10 Thread Patrick Hunt
Great! Basic details are here (create a jira, attach a patch, click submit and someone will review and help you get it into a state which we can commit). Probably you'd put your code into src/recipes or src/contrib (recipes sounds reasonable).

Re: Too many KeeperErrorCode = Session moved messages

2010-08-08 Thread Patrick Hunt
I suspect this is a bug with the sync call and session moved (the code path for sync is a bit special). Please enter a JIRA for this. Thanks. Patrick On 08/05/2010 01:20 PM, Vishal K wrote: Hi All, I am seeing a lot of these messages in our application. I would like to know if I am doing

Re: Using watcher for being notified of children addition/removal

2010-08-02 Thread Patrick Hunt
You may want to consider adding a distributed queue to your use of ZK. As was mentioned previously, watches don't notify you of every change, just that a change was made. For example multiple changes may be visible when you get the notification. A distributed queue would allow you to log

Re: JMX error while starting ZooKeeper

2010-07-19 Thread Patrick Hunt
On 07/19/2010 05:04 PM, Rakesh Aggarwal wrote: javax.management.MBeanServer; was not found Sounds like you are missing rt.jar for some reason (contains that class). Try running java -verbose -version and see what jars are being picked up, I see a number of lines containing: ...

Re: Errors with Python bindings

2010-07-16 Thread Patrick Hunt
Hi Rich, the version string looks useful to have, thanks! Would you mind submitting this via jira? Do a svn diff (looks like you did already), create a jira and attach the diff, then click submit link on the jira. We'll review and work on getting it into a future release.

Re: total # of zknodes

2010-07-15 Thread Patrick Hunt
I've done some tests with ~600 clients creating 5 million znodes (size 100bytes iirc) and 25million watches. I was using 8gb of memory for this, however --- in this scenario it's critical that you tune the GC, in particular you need to turn on CMS and incremental GC options. Otw when the GC

Re: Suggested way to simulate client session expiration in unit tests?

2010-07-06 Thread Patrick Hunt
If you want to simulate expiration use the example I sent. http://github.com/phunt/zkexamples Another option is to use a mock. Patrick On 07/06/2010 05:42 PM, Jeremy Davis wrote: Thanks! That seems to work, but it is approximately the same as zooKeeper.close() in that there is no

Re: Zookeeper outage recap questions

2010-07-01 Thread Patrick Hunt
Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one

Re: Guaranteed message delivery until session timeout?

2010-06-30 Thread Patrick Hunt
On 06/30/2010 09:37 AM, Ted Dunning wrote: Which API are you talking about? C? I think that the difference between connection loss and session expiration might mess you up slightly in your disjunction here. On Wed, Jun 30, 2010 at 7:45 AM, Bryan Thompsonbr...@systap.com wrote: I am

Re: Receive timed out error while starting zookeeper server

2010-06-27 Thread Patrick Hunt
On 06/26/2010 06:53 AM, Peeyush Kumar wrote: I have a 6 node cluster (5 slaves and 1 master). I am trying to You typically want an odd number given that zk works by majority (even is fine, but not optimal). So 5 would be great (7 is a bit of overkill). 3 is fine too, but 5 allows

Re: 答复: Starting zookeeper in replicat ed mode

2010-06-22 Thread Patrick Hunt
There are 3 ports that need to be opened 1) the client port (btw client and servers) 2/3) the quorum and election ports - only btw servers You are setting these three ports in your config file (clientport defaults to 2181 iirc, unless you override) Patrick On 06/22/2010 06:17 AM, Erik Test

Re: Free Software Solution to continuously load a large number of feeds with several servers?

2010-06-18 Thread Patrick Hunt
I've seen a number of these built as proprietary solutions using ZooKeeper. It would be great to see something open sourced. HBase/ZK seems like a good fit. You might also consider ZooKeeper/BookKeeper. Patrick On 06/18/2010 11:01 AM, Thomas Koch wrote:

Re: zookeeper crash

2010-06-16 Thread Patrick Hunt
it. On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Patrick Hunt
I'm not very experienced personally with running zk on ec2 smalls, Ted usually has the ec2 related insight. Given these boxes are not loaded or lightly loaded, and you've ruled out gc/swap, the only thing I can think of is that something is going on under the covers at the vm level that's

Re: zookeeper watch triggered multiple times on same event

2010-06-15 Thread Patrick Hunt
I don't think this should be possible (if it happens it's a bug in zk). Perhaps, for some reason, there really are 2 change actions (children created, or the same child created twice) and not just one? Re-registering the watch inside the watch is fine. The server sends watch notifications as

Re: Debugging help for SessionExpiredException

2010-06-11 Thread Patrick Hunt
Session expiration is due to the server not hearing heartbeats from the client. So either the client is partitioned from the server, or the client is not sending heartbeats for some reason, typically this is due to the client JVM gc'ing or swapping. Patrick On 06/10/2010 04:14 PM, Ted

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt
100mb partition? sounds like virtualization. resource starvation (worse in virtualized env) is a common cause of this. Are your clients gcing/swapping at all? If a client gc's for long periods of time the heartbeat thread won't be able to run and the server will expire the session. There is a

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt
On 06/09/2010 03:35 PM, Lei Zhang wrote: We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is: windows may have gone to hibernation, the zk heartbeat logic

Re: Simulating failures?

2010-06-04 Thread Patrick Hunt
Here's how to test session expiration (haven't tried this in a while): http://github.com/phunt/zkexamples It would be great to have some test infrastructure/examples/docs/strategies available for developers (zk client users). If someone would be interested to workon/contribute this we'd be

Re: Locking and Partial Failure

2010-05-31 Thread Patrick Hunt
Hi Charles, any luck with this? Re the issues you found with the recipes please enter a JIRA, it would be good to address the problem(s) you found. https://issues.apache.org/jira/browse/ZOOKEEPER re use of session/thread id, might you use some sort of unique token that's dynamically assigned

Re: Securing ZooKeeper connections

2010-05-27 Thread Patrick Hunt
On 05/27/2010 09:47 AM, Benjamin Reed wrote: actually pat hunt took over that issue: ZOOKEEPER-733. pat has made a lot of progress and the patch looks close to being ready. This is just the server side though, still need to make similar changes on the client. That will likely be a separate

Re: Securing ZooKeeper connections

2010-05-27 Thread Patrick Hunt
Short of someone else stepping up I have it on my todo list. ;-) Still quite a bit of work to do on 733 though getting it back into shape. (not to mention layering the ssl on top). Then there's also the server-server connectivity that also needs to have netty support added (quorum/election

Re: Question about concurrent primitives library

2010-05-26 Thread Patrick Hunt
Hi, this was originally proposed as a google summer of code project, the slots for gsoc have already been given out, this was not one of the projects chosen by apache. So you could still work on this if you like, but not under the gsoc umbrella. We (zk contributor community) would be happy to

Re: Ping and client session timeouts

2010-05-21 Thread Patrick Hunt
Hi Stephen, my comments inline below: On 05/21/2010 09:31 AM, Stephen Green wrote: I feel like I'm missing something fairly fundamental here. I'm building a clustered application that uses ZooKeeper (3.3.1) to store its configuration information. There are 33 nodes in the cluster (Amazon EC2

Re: Ping and client session timeouts

2010-05-21 Thread Patrick Hunt
On 05/21/2010 11:32 AM, Stephen Green wrote: Right. The system can be very memory-intensive, but at the time these are occurring, it's not under a really heavy load, and there's plenty of heap available. However, while looking at a thread dump from one of the nodes, I realized that a very poor

Re: Concurrent reads and writes on BookKeeper

2010-05-20 Thread Patrick Hunt
those JIRAs. Thanks! Patrick -Flavio On May 20, 2010, at 1:36 AM, Patrick Hunt wrote: On 05/19/2010 01:23 PM, Flavio Junqueira wrote: Hi Andre, To guarantee that two clients that read from a ledger will read the same sequence of entries, we need to make sure that there is agreement on the end

[ANNOUNCE] Apache ZooKeeper 3.3.1

2010-05-17 Thread Patrick Hunt
The Apache ZooKeeper team is proud to announce Apache ZooKeeper version 3.3.1 ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface

Re: Using ZooKeeper for managing solrCloud

2010-05-14 Thread Patrick Hunt
Mahadev pointed out the ZK monitoring details, but on the solr side of the house I don't think we can provide much insight as solr is acting as a client of the zk service. Your best bet would be to ask on the solr user list. Regards, Patrick On 05/14/2010 04:09 AM, Rakhi Khatwani wrote:

Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Patrick Hunt
Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm

Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Patrick Hunt
the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think

Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Patrick Hunt
that getChildren (xid 7) got lost. Patrick On 05/12/2010 11:30 AM, Jordan Zimmerman wrote: Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs, etc. Yes, we're calling getChildren with the callback. -JZ On May 12, 2010, at 11:28 AM, Patrick Hunt wrote: I'm still interested though... Are you

Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Patrick Hunt
On May 12, 2010, at 11:41 AM, Benjamin Reed wrote: is this a bug? shouldn't we be returning an error. ben On 05/12/2010 11:34 AM, Patrick Hunt wrote: I think that explains it then - the server is probably dropping the new (3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know

Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Patrick Hunt
Hm, if you don't mind enter that jira, would still like to verify by looking at the logs. Patrick On 05/12/2010 11:52 AM, Jordan Zimmerman wrote: So, I'm off the Jira hook then? -JZ On May 12, 2010, at 11:49 AM, Patrick Hunt wrote: You're right. Ben, would you mind entering a JIRA

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Patrick Hunt
On 05/12/2010 08:30 PM, Aaron Crow wrote: I may have a better idea of what caused the trouble. I way, WAY underestimated the number of nodes we collect over time. Right now we're at 1.9 million. This isn't a bug of our application; it's actually a feature (but perhaps an ill-conceived one). A

Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused

2010-05-11 Thread Patrick Hunt
The cases where we've seen this reported in the past the user tracked the issue down to a firewall problem, I'm not sure what the issue is here given you've verified that's not the problem. The log is clearly saying: Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election

Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused

2010-05-11 Thread Patrick Hunt
Ok, great, good luck! Patrick On 05/10/2010 11:20 PM, chen peng wrote: My question has been decided. *I did not http://www.iciba.com/not/ know http://www.iciba.com/know/ bin/zkServer start should be execute on each machine!* *I took him to be very close in function with

Re: New ZooKeeper client library Cages

2010-05-11 Thread Patrick Hunt
Hi Dominic, this looks really interesting thanks for open sourcing it. I really like the idea of providing higher level concepts. I only just looked at the code, it wasn't clear on first pass what happens if you multilock on 3 paths, the first 2 are success, but the third fails. How are the

Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused

2010-05-08 Thread Patrick Hunt
Often this is related to the port(s) being blocked by a firewall. Perhaps you could check this (2888/3888) in both directions? Telnet can help: https://help.maximumasp.com/KB/a445/connectivity-testing-with-ping-telnet-tracert-and-pathping-.aspx Patrick 2010/5/7 chen peng chenpeng0...@hotmail.com

Re: ZKClient

2010-05-05 Thread Patrick Hunt
Thanks Travis, I've slated this for 3.4.0, I think it would be useful to add more examples so feel free to add more if you have any ideas for useful ones. For future reference, we ask that contributions come in the form of a patch: http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute It's

Re: ZKClient

2010-05-05 Thread Patrick Hunt
While I agree DS is hard, I don't think we should lose the useful feedback given by Jonathan/Adam - that getting started with ZK is challenging and can be frustrating. We need to learn from this feedback and create some action items to address. One of the main things I've heard so far that we

Re: ZKClient

2010-05-04 Thread Patrick Hunt
Take a look at this thread for some background. http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg00917.html There were some concerns at the time, not sure if they have been addressed since (It has been a while since that discussion). Patrick On 05/04/2010 01:48 PM, Jonathan

Re: avoiding deadlocks on client handle close w/ python/c api

2010-05-04 Thread Patrick Hunt
Thanks Kapil, Mahadev perhaps you could take a look at this as well? Patrick On 05/04/2010 06:36 AM, Kapil Thangavelu wrote: I've constructed a simple example just using the zkpython library with condition variables, that will deadlock. I've filed a new ticket for it,

  1   2   3   >