[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Dave Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921459#action_12921459
 ] 

Dave Wright commented on ZOOKEEPER-885:
---

I don't think the cause of this is much of a mystery, as we experienced similar 
problems when we had the zookeeper files on the same filesystem as an IO-heavy 
application that was doing buffered IO. Quite simply, when zookeeper does a 
sync on its own files, it causes the entire filesystem to sync, flushing any 
buffered data from the IO-heavy application and freezing the ZK server process 
for long enough for heartbeats to timeout. 

When you say moderate IO load I'm curious what the bottleneck is - the dd 
command will copy data as fast as possible, if you're only getting 4MB/sec, the 
underlying device must be pretty slow, which would further indicate why a 
sync() request would take a while to complete. 

The only fix we've seen is to put the ZK files on their own device, although 
you may be able to fix it with a different partition on the same device. 

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Dave Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921528#action_12921528
 ] 

Dave Wright commented on ZOOKEEPER-885:
---

Has it been verified that ZK is doing no disk activity at all during that time? 
What about log file writes? What about sessions being established/torn down 
(which would cause syncs)?

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



build.xml from 3.3.1 distribution has version=3.3.2-dev

2010-09-17 Thread Dave Wright
Ended up chasing our tail for a while today because the ant build.xml
in the 3.3.1 distribution on the website has version is actually
3.3.2-dev. Any particular reason for that? The jar included in the
distribution is obviously labeled correctly, and the version number in
its manifest is 3.3.1, so it appears that the build.xml was modified
and bundled up for distribution after the binary was built. Not sure
if it's worth fixing in the distribution, but thought I'd at least
mention it.

-Dave Wright


[jira] Commented: (ZOOKEEPER-704) GSoC 2010: Read-Only Mode

2010-05-08 Thread Dave Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865465#action_12865465
 ] 

Dave Wright commented on ZOOKEEPER-704:
---

This is a great idea, but I'm afraid there is a somewhat fundamental problem 
with this concept. 
What you want is if enough nodes go down that a quorum can't be formed (at 
all), the remaining nodes go into read-only mode.

The problem is that if a partition occurs (say, a single server loses contact 
with the rest of the cluster), but a quorum still exists, we want clients who 
were connected to the partitioned server to re-connect to a server in the 
majority. The current design allows for this by denying connections to minority 
nodes, forcing clients to hunt for the majority. If we allow servers in the 
minority to keep/accept connections, then clients will end up in read-only mode 
when they could have simply reconnected to the majority.

It may be possible to accomplish the desired outcome with some client-side and 
connection protocol changes. Specifically, a flag on the connection request 
from the client that says allow read-only connections - if false, the server 
will close the connection, allowing the client to hunt for a server in the 
majority. Once a client has gone through all the servers in the list (and found 
out that none are in the majority) it could flip the flag to true and connect 
to any running servers in read-only mode. There is still the question of how to 
get back out of read only mode (e.g. should we keep hunting in the background 
for a majority, or just wait until the server we are connected to re-forms a 
quorum).

 GSoC 2010: Read-Only Mode
 -

 Key: ZOOKEEPER-704
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-704
 Project: Zookeeper
  Issue Type: Wish
Reporter: Henry Robinson

 Read-only mode
 Possible Mentor
 Henry Robinson (henry at apache dot org)
 Requirements
 Java and TCP/IP networking
 Description
 When a ZooKeeper server loses contact with over half of the other servers in 
 an ensemble ('loses a quorum'), it stops responding to client requests 
 because it cannot guarantee that writes will get processed correctly. For 
 some applications, it would be beneficial if a server still responded to read 
 requests when the quorum is lost, but caused an error condition when a write 
 request was attempted.
 This project would implement a 'read-only' mode for ZooKeeper servers (maybe 
 only for Observers) that allowed read requests to be served as long as the 
 client can contact a server.
 This is a great project for getting really hands-on with the internals of 
 ZooKeeper - you must be comfortable with Java and networking otherwise you'll 
 have a hard time coming up to speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Dave Wright
 Could you provide some insight into why you need this? Just so we have addl
 background, I'm interested to know the use case.

Sure, we're building a clustered application that will use zookeeper
as part of it. We need to manage ZK ourself. The cluster running the
app  ZK may change over time (nodes added or removed) and we need to
keep ZK itself in-sync with any changes. They won't be common, but we
can't shut the app down to make the changes, it needs to be
transparent.


 Are you expecting all of the servers to change each time, or just
 incremental changes (add/remove a single server, vs say move the entire
 cluster from 3 hosts a/b/c to x/y/z)

I'd expect a small number of changes at any time - a few nodes being
added, a few nodes being removed. Most of the nodes will stay the
same.


 Any chance you could use DNS for this? ie change the mapping for the
 hostname from a - x ip? Since the server a will go down anyway, this would
 cause the client to reconnect to b/c (eventually when dns ttl expires the
 client would also potentially connect to x).
 https://issues.apache.org/jira/browse/ZOOKEEPER-328
 https://issues.apache.org/jira/browse/ZOOKEEPER-338


Well, there are a lot of issues with DNS (including security  cache)
so I'd prefer to avoid it. Also, the real issue is the # of servers
are changing, not just their IP.
Although we probably wouldn't use it, I do think it would be nice to
support a single hostname for the ZK cluster with one A records for
each member, and have the ZK client handle resolving that properly
each time it connects.


 You might also look at this patch, we never committed it but it might be
 interesting to you:
 https://issues.apache.org/jira/browse/ZOOKEEPER-146

 The benefit is that you'd only have one place to make the change, esp given
 that clients might be down/unreachable when this change occurs. Clients
 would have to poll this service whenever they get disconnected from the
 ensemble. One drawback of this approach is that the HTTP now becomes a
 potential SPOF. (although I guess you could always fall back to something,
 or potentially have a list of HTTP hosts to do the lookup, etc...).

Well, that just handles distribution of the list (which isn't really
our problem), it doesn't help with restarting the ZK client when the
list changes - it only pulls the list once, so you still have to
completely shutdown and restart the ZK client.


 It does sound interesting, however once we add something like this it's hard
 to change given that we try very hard to maintain b/w compatibility. If you
 did the testing and were able to verify I don't see why we couldn't add it -
 as it's optional in the sense that it would only be called in the use case
 you describe. I would feel more confident if we had more concrete detail on
 how we intend to do 107 (a basic functional/design doc that at least reviews
 all the issues), and how this would fit in. But I don't see that should
 necessarily be a blocker (although others might feel differently).

Have you ever considered adding features like this via a protected
interface (i.e. the are useful but aren't fully standardized, so if a
client wants to use it they can sub-class ZK and make them public)?

The ability to dynamically modify the server list on the client side
seems like it would be required no matter what approach were taken to
dynamic clusters.

-Dave Wright


Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Dave Wright
 Well, that just handles distribution of the list (which isn't really
 our problem), it doesn't help with restarting the ZK client when the
 list changes - it only pulls the list once, so you still have to
 completely shutdown and restart the ZK client.


 Well the old server is being shutdown right? If the client were connected to
 that server this would force the client to reconnect to another server, what
 I was suggesting is that the client would ping the server lookup service
 as part of this. (so lookup on every disconnect say)

Perhaps we should clarify what you mean by client (..would ping..).
If you mean the ZK client library, then that would make sense - rather
than use a static list of servers, each time it was disconnected it
would refresh it's list and pick one.
I took it to mean the client application (using the ZK library). The
issue is that the client application has no way to tell the ZK client
lib to use a different list of servers, other than a complete teardown
of the ZK object  session, which I'm trying to avoid.


 Hasn't come up before, but yes I agree it's a useful feature.

Ok, thanks. We don't have a specific ETA to implement it, I just
wanted to explore the option a bit before we finalized some aspects of
our design. Should we do the work I'll submit matches for the Java and
C client.

-Dave


Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Dave Wright

 Yes, that's what I meant - we could update the ZK client lib to do this. It
 would be invisible to the client application (your code) itself.

I don't think that's a bad idea, and the general approach in ZK-146 of
using an interface that gets called to retrieve the list of hosts
seems good (so that you aren't tied to a specific implementation of
hosts lists, be it HTTP or DNS). That said, I don't think the actual
implementation of ZK-146 is a good solution, since it only resolves
the host list once. An implementation that resolved it on each
disconnection would be better but require deeper changes to the
ClientCnxn.

-Dave


[jira] Created: (ZOOKEEPER-762) Allow dynamic addition/removal of server nodes in the client API

2010-05-03 Thread Dave Wright (JIRA)
Allow dynamic addition/removal of server nodes in the client API


 Key: ZOOKEEPER-762
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-762
 Project: Zookeeper
  Issue Type: Sub-task
  Components: c client, java client
Reporter: Dave Wright
Priority: Minor


Currently the list of zookeeper servers needs to be provided to the client APIs 
at construction time, and cannot be changed without a complete shutdown/restart 
of the client API. However, there are scenarios that require the server list to 
be updated, such as removal or addition of a ZK cluster node, and it would be 
nice if the list could be updated via a simple API call.

The general approach (in the Java client) would be to 
RemoveServer()/AddServer() functions for Zookeeper that calls down to 
ClientCnxn, where they are just maintained in a list. Of course if
the server being removed is the one currently connected, we'd need to 
disconnect, but a simple call to disconnect() seems like it would resolve that 
and trigger the automatic re-connection logic.
An equivalent change could be made in the C code. 

This change would also make dynamic cluster membership in ZOOKEEPER-107 easier 
to implement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Dave Wright
 Should this be a znode in the privileged namespace?


I think having a znode for the current cluster members is part of the
ZOOKEEPER-107 proposal, with the idea being that you could get/set the
membership just by writing to that node. On the client side, you could
watch that znode and update your server list when it changes. I think
it would be a great solution, but I was thinking the ability to
manually manage the server list would be useful in the interim, or if
ZK-107 takes a different path.

-Dave


Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Dave Wright


 This is tricky: what happens if the server your client is connected to is
 decommissioned by a view change, and you are unable to locate another server
 to connect to because other view changes committed while you are
 reconnecting have removed all the servers you knew about. We'd need to make
 sure that watches on this znode were fired before a view change, but it's
 hard to know how to avoid having to wait for a session timeout before a
 client that might just be migrating servers reappears in order to make sure
 it sees the veiw change.

 Even then, the problem of 'locating' the cluster still exists in the case
 that there are no clients connected to tell anyone about it.

Yes, this doesn't completely solve two issues:
1. Bootstrapping the cluster itself  clients
2. Major cluster reconfiguration (e.g. switching out every node before
clients can pickup the changes).

That said, I think it gets close and could still be useful.
For #1, you could simply require that the initial servers in the
cluster be manually configured, then servers could be added and
removed as needed. New servers would just need the address of one
other server to join and get the full server list. For clients,
you'd have a similar situation - you still need a way to pass an
initial server list (or at least 1 valid server) in to the client, but
that could be via HTTP, DNS, or manual list, then the clients
themselves could stay in sync with changes.
For #2, you could simply document that there are limits to how fast
you want to change the cluster, and that if you make too many changes
too fast, clients or servers may not pick up the change fast enough
and need to be restarted. In reality I don't think this will be much
of an issue - as long as at least one server from the starting state
stays up until everyone else gets reconnected, everyone should
eventually find that node and get the full server list.

-Dave