[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921459#action_12921459 ] Dave Wright commented on ZOOKEEPER-885: --- I don't think the cause of this is much of a mystery, as we experienced similar problems when we had the zookeeper files on the same filesystem as an IO-heavy application that was doing buffered IO. Quite simply, when zookeeper does a sync on its own files, it causes the entire filesystem to sync, flushing any buffered data from the IO-heavy application and freezing the ZK server process for long enough for heartbeats to timeout. When you say moderate IO load I'm curious what the bottleneck is - the dd command will copy data as fast as possible, if you're only getting 4MB/sec, the underlying device must be pretty slow, which would further indicate why a sync() request would take a while to complete. The only fix we've seen is to put the ZK files on their own device, although you may be able to fix it with a different partition on the same device. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921528#action_12921528 ] Dave Wright commented on ZOOKEEPER-885: --- Has it been verified that ZK is doing no disk activity at all during that time? What about log file writes? What about sessions being established/torn down (which would cause syncs)? Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
build.xml from 3.3.1 distribution has version=3.3.2-dev
Ended up chasing our tail for a while today because the ant build.xml in the 3.3.1 distribution on the website has version is actually 3.3.2-dev. Any particular reason for that? The jar included in the distribution is obviously labeled correctly, and the version number in its manifest is 3.3.1, so it appears that the build.xml was modified and bundled up for distribution after the binary was built. Not sure if it's worth fixing in the distribution, but thought I'd at least mention it. -Dave Wright
[jira] Commented: (ZOOKEEPER-704) GSoC 2010: Read-Only Mode
[ https://issues.apache.org/jira/browse/ZOOKEEPER-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865465#action_12865465 ] Dave Wright commented on ZOOKEEPER-704: --- This is a great idea, but I'm afraid there is a somewhat fundamental problem with this concept. What you want is if enough nodes go down that a quorum can't be formed (at all), the remaining nodes go into read-only mode. The problem is that if a partition occurs (say, a single server loses contact with the rest of the cluster), but a quorum still exists, we want clients who were connected to the partitioned server to re-connect to a server in the majority. The current design allows for this by denying connections to minority nodes, forcing clients to hunt for the majority. If we allow servers in the minority to keep/accept connections, then clients will end up in read-only mode when they could have simply reconnected to the majority. It may be possible to accomplish the desired outcome with some client-side and connection protocol changes. Specifically, a flag on the connection request from the client that says allow read-only connections - if false, the server will close the connection, allowing the client to hunt for a server in the majority. Once a client has gone through all the servers in the list (and found out that none are in the majority) it could flip the flag to true and connect to any running servers in read-only mode. There is still the question of how to get back out of read only mode (e.g. should we keep hunting in the background for a majority, or just wait until the server we are connected to re-forms a quorum). GSoC 2010: Read-Only Mode - Key: ZOOKEEPER-704 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-704 Project: Zookeeper Issue Type: Wish Reporter: Henry Robinson Read-only mode Possible Mentor Henry Robinson (henry at apache dot org) Requirements Java and TCP/IP networking Description When a ZooKeeper server loses contact with over half of the other servers in an ensemble ('loses a quorum'), it stops responding to client requests because it cannot guarantee that writes will get processed correctly. For some applications, it would be beneficial if a server still responded to read requests when the quorum is lost, but caused an error condition when a write request was attempted. This project would implement a 'read-only' mode for ZooKeeper servers (maybe only for Observers) that allowed read requests to be served as long as the client can contact a server. This is a great project for getting really hands-on with the internals of ZooKeeper - you must be comfortable with Java and networking otherwise you'll have a hard time coming up to speed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Dynamic adding/removing ZK servers on client
Could you provide some insight into why you need this? Just so we have addl background, I'm interested to know the use case. Sure, we're building a clustered application that will use zookeeper as part of it. We need to manage ZK ourself. The cluster running the app ZK may change over time (nodes added or removed) and we need to keep ZK itself in-sync with any changes. They won't be common, but we can't shut the app down to make the changes, it needs to be transparent. Are you expecting all of the servers to change each time, or just incremental changes (add/remove a single server, vs say move the entire cluster from 3 hosts a/b/c to x/y/z) I'd expect a small number of changes at any time - a few nodes being added, a few nodes being removed. Most of the nodes will stay the same. Any chance you could use DNS for this? ie change the mapping for the hostname from a - x ip? Since the server a will go down anyway, this would cause the client to reconnect to b/c (eventually when dns ttl expires the client would also potentially connect to x). https://issues.apache.org/jira/browse/ZOOKEEPER-328 https://issues.apache.org/jira/browse/ZOOKEEPER-338 Well, there are a lot of issues with DNS (including security cache) so I'd prefer to avoid it. Also, the real issue is the # of servers are changing, not just their IP. Although we probably wouldn't use it, I do think it would be nice to support a single hostname for the ZK cluster with one A records for each member, and have the ZK client handle resolving that properly each time it connects. You might also look at this patch, we never committed it but it might be interesting to you: https://issues.apache.org/jira/browse/ZOOKEEPER-146 The benefit is that you'd only have one place to make the change, esp given that clients might be down/unreachable when this change occurs. Clients would have to poll this service whenever they get disconnected from the ensemble. One drawback of this approach is that the HTTP now becomes a potential SPOF. (although I guess you could always fall back to something, or potentially have a list of HTTP hosts to do the lookup, etc...). Well, that just handles distribution of the list (which isn't really our problem), it doesn't help with restarting the ZK client when the list changes - it only pulls the list once, so you still have to completely shutdown and restart the ZK client. It does sound interesting, however once we add something like this it's hard to change given that we try very hard to maintain b/w compatibility. If you did the testing and were able to verify I don't see why we couldn't add it - as it's optional in the sense that it would only be called in the use case you describe. I would feel more confident if we had more concrete detail on how we intend to do 107 (a basic functional/design doc that at least reviews all the issues), and how this would fit in. But I don't see that should necessarily be a blocker (although others might feel differently). Have you ever considered adding features like this via a protected interface (i.e. the are useful but aren't fully standardized, so if a client wants to use it they can sub-class ZK and make them public)? The ability to dynamically modify the server list on the client side seems like it would be required no matter what approach were taken to dynamic clusters. -Dave Wright
Re: Dynamic adding/removing ZK servers on client
Well, that just handles distribution of the list (which isn't really our problem), it doesn't help with restarting the ZK client when the list changes - it only pulls the list once, so you still have to completely shutdown and restart the ZK client. Well the old server is being shutdown right? If the client were connected to that server this would force the client to reconnect to another server, what I was suggesting is that the client would ping the server lookup service as part of this. (so lookup on every disconnect say) Perhaps we should clarify what you mean by client (..would ping..). If you mean the ZK client library, then that would make sense - rather than use a static list of servers, each time it was disconnected it would refresh it's list and pick one. I took it to mean the client application (using the ZK library). The issue is that the client application has no way to tell the ZK client lib to use a different list of servers, other than a complete teardown of the ZK object session, which I'm trying to avoid. Hasn't come up before, but yes I agree it's a useful feature. Ok, thanks. We don't have a specific ETA to implement it, I just wanted to explore the option a bit before we finalized some aspects of our design. Should we do the work I'll submit matches for the Java and C client. -Dave
Re: Dynamic adding/removing ZK servers on client
Yes, that's what I meant - we could update the ZK client lib to do this. It would be invisible to the client application (your code) itself. I don't think that's a bad idea, and the general approach in ZK-146 of using an interface that gets called to retrieve the list of hosts seems good (so that you aren't tied to a specific implementation of hosts lists, be it HTTP or DNS). That said, I don't think the actual implementation of ZK-146 is a good solution, since it only resolves the host list once. An implementation that resolved it on each disconnection would be better but require deeper changes to the ClientCnxn. -Dave
[jira] Created: (ZOOKEEPER-762) Allow dynamic addition/removal of server nodes in the client API
Allow dynamic addition/removal of server nodes in the client API Key: ZOOKEEPER-762 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-762 Project: Zookeeper Issue Type: Sub-task Components: c client, java client Reporter: Dave Wright Priority: Minor Currently the list of zookeeper servers needs to be provided to the client APIs at construction time, and cannot be changed without a complete shutdown/restart of the client API. However, there are scenarios that require the server list to be updated, such as removal or addition of a ZK cluster node, and it would be nice if the list could be updated via a simple API call. The general approach (in the Java client) would be to RemoveServer()/AddServer() functions for Zookeeper that calls down to ClientCnxn, where they are just maintained in a list. Of course if the server being removed is the one currently connected, we'd need to disconnect, but a simple call to disconnect() seems like it would resolve that and trigger the automatic re-connection logic. An equivalent change could be made in the C code. This change would also make dynamic cluster membership in ZOOKEEPER-107 easier to implement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Dynamic adding/removing ZK servers on client
Should this be a znode in the privileged namespace? I think having a znode for the current cluster members is part of the ZOOKEEPER-107 proposal, with the idea being that you could get/set the membership just by writing to that node. On the client side, you could watch that znode and update your server list when it changes. I think it would be a great solution, but I was thinking the ability to manually manage the server list would be useful in the interim, or if ZK-107 takes a different path. -Dave
Re: Dynamic adding/removing ZK servers on client
This is tricky: what happens if the server your client is connected to is decommissioned by a view change, and you are unable to locate another server to connect to because other view changes committed while you are reconnecting have removed all the servers you knew about. We'd need to make sure that watches on this znode were fired before a view change, but it's hard to know how to avoid having to wait for a session timeout before a client that might just be migrating servers reappears in order to make sure it sees the veiw change. Even then, the problem of 'locating' the cluster still exists in the case that there are no clients connected to tell anyone about it. Yes, this doesn't completely solve two issues: 1. Bootstrapping the cluster itself clients 2. Major cluster reconfiguration (e.g. switching out every node before clients can pickup the changes). That said, I think it gets close and could still be useful. For #1, you could simply require that the initial servers in the cluster be manually configured, then servers could be added and removed as needed. New servers would just need the address of one other server to join and get the full server list. For clients, you'd have a similar situation - you still need a way to pass an initial server list (or at least 1 valid server) in to the client, but that could be via HTTP, DNS, or manual list, then the clients themselves could stay in sync with changes. For #2, you could simply document that there are limits to how fast you want to change the cluster, and that if you make too many changes too fast, clients or servers may not pick up the change fast enough and need to be restarted. In reality I don't think this will be much of an issue - as long as at least one server from the starting state stays up until everyone else gets reconnected, everyone should eventually find that node and get the full server list. -Dave