from:"Alexandre Hardy \(JIRA\)"

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-04 Thread Alexandre Hardy (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928137#action_12928137
 ] 

Alexandre Hardy commented on ZOOKEEPER-917:
---

The excerpts are extracted from {{hbase-0.20/hbase*.log}}, so the information 
should be readily available.
The tar file contents should be as follows:
{noformat}
drwxr-xr-x ah/users  0 2010-11-02 14:42 192.168.130.10/
drwxr-xr-x ah/users  0 2010-11-03 13:33 192.168.130.10/hbase-0.20/
-rw-r--r-- ah/users  0 2010-11-02 14:42 
192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out
-rw-r--r-- ah/users   62922921 2010-11-02 14:42 
192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log
drwxr-xr-x ah/users  0 2010-11-02 14:42 192.168.130.12/
drwxr-xr-x ah/users  0 2010-11-03 13:27 192.168.130.12/hbase-0.20/
drwxr-xr-x ah/users  0 2010-11-02 14:42 192.168.130.13/
drwxr-xr-x ah/users  0 2010-11-03 13:27 192.168.130.13/hbase-0.20/
-rw-r--r-- ah/users   65903411 2010-11-02 14:42 
192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log
-rw-r--r-- ah/users  0 2010-11-02 14:42 
192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out
drwxr-xr-x ah/users  0 2010-11-02 14:42 192.168.130.14/
drwxr-xr-x ah/users  0 2010-11-03 13:27 192.168.130.14/hbase-0.20/
-rw-r--r-- ah/users  0 2010-11-02 14:42 
192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out
-rw-r--r-- ah/users   62835121 2010-11-02 14:42 
192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log
{noformat}

The only logs that are missing are those for .11, but that should not influence 
the analysis of the leader election (I hope).

We are using monitoring software which determines when a zookeeper instance is 
no longer reachable, and automatically starts a fresh zookeeper instance as 
replacement. This software can determine the failure and start a new zookeeper 
instance fairly rapidly. Would it be better to delay the start of a fresh 
zookeeper instance to allow the existing instances to elect a new leader? If 
so, do you have any guidelines regarding this delay? (We are considering this 
approach, but would like to avoid it if possible).

{quote}
In your case, I'm still not sure why it happens because the initial zxid of 
node 1 is 4294967742 according to your excerpt. 
{quote}
That is indeed the key question that I am trying to find an answer for! :-)

 Leader election selected incorrect leader
 -

 Key: ZOOKEEPER-917
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.2.2
 Environment: Cloudera distribution of zookeeper (patched to never 
 cache DNS entries)
 Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
 Fix For: 3.3.3, 3.4.0

 Attachments: zklogs-20101102144159SAST.tar.gz


 We had three nodes running zookeeper:
   * 192.168.130.10
   * 192.168.130.11
   * 192.168.130.14
 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
 (automated startup). The new node had not participated in any zookeeper 
 quorum previously. The node 192.148.130.11 was permanently removed from 
 service and could not contribute to the quorum any further (powered off).
 DNS entries were updated for the new node to allow all the zookeeper servers 
 to find the new node.
 The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
 it had not seen the latest zxid.
 This particular problem has not been verified with later versions of 
 zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-04 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928174#action_12928174
]

Alexandre Hardy commented on ZOOKEEPER-917:
---

Hi Flavio,

At first pass this seems to indicate that we can't replace a failed zookeeper
server by a new one, but that statement is probably way too strong. If I
understand correctly, what you are saying is that the server can be replaced
only after a new leader has been elected? i.e. Any fresh server should only be
restarted once the quorum has been reestablished?

I'm not sure I understand exactly why the election went wrong. Were the old
election messages resent when the Fresh server was contactable? I would have
thought that election messages should be based on the current state, and never
send old state.

This will take some time to digest and think through properly. In the meantime,
can you suggest how we should deal with this situation, can we simply wait for
the two remaining nodes to establish a quorum, and then reintroduce the third
node? I suppose we could test if a quorum has been established by testing if we
can establish a new zookeeper session.

Thanks for the help

Leader election selected incorrect leader
-

Key: ZOOKEEPER-917
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
Project: Zookeeper
Issue Type: Bug
Components: leaderElection, server
Affects Versions: 3.2.2
Environment: Cloudera distribution of zookeeper (patched to never
cache DNS entries)
Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
Fix For: 3.3.3, 3.4.0

Attachments: zklogs-20101102144159SAST.tar.gz

We had three nodes running zookeeper:
* 192.168.130.10
* 192.168.130.11
* 192.168.130.14
192.168.130.11 failed, and was replaced by a new node 192.168.130.13
(automated startup). The new node had not participated in any zookeeper
quorum previously. The node 192.148.130.11 was permanently removed from
service and could not contribute to the quorum any further (powered off).
DNS entries were updated for the new node to allow all the zookeeper servers
to find the new node.
The new node 192.168.130.13 was selected as the LEADER, despite the fact that
it had not seen the latest zxid.
This particular problem has not been verified with later versions of
zookeeper, and no attempt has been made to reproduce this problem as yet.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-04 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928184#action_12928184
]

Alexandre Hardy commented on ZOOKEEPER-917:
---

Thanks Flavio,

We will take the delay approach for the moment. I think the risks are
acceptable for our purposes.

You are welcome to close the issue if there are no outstanding questions that
need to be addressed on your part.

Thanks again for spending so much time on this issue and explaining what the
reasoning and consequences are.

Leader election selected incorrect leader
-

Attachments: zklogs-20101102144159SAST.tar.gz

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)

Leader election selected incorrect leader
-

 Key: ZOOKEEPER-917
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.2.2
 Environment: Cloudera distribution of zookeeper (patched to never 
cache DNS entries)
Debian lenny
Reporter: Alexandre Hardy
Priority: Critical


We had three nodes running zookeeper:
  * 192.168.130.10
  * 192.168.130.11
  * 192.168.130.14

192.168.130.11 failed, and was replaced by a new node 192.168.130.13 (automated 
startup). The new node had not participated in any zookeeper quorum previously. 
The node 192.148.130.11 was permanently removed from service and could not 
contribute to the quorum any further (powered off).

DNS entries were updated for the new node to allow all the zookeeper servers to 
find the new node.

The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
it had not seen the latest zxid.

This particular problem has not been verified with later versions of zookeeper, 
and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-917:
--

Attachment: zklogs-20101102144159SAST.tar.gz

Logs for the remaining nodes attached.

Leader election selected incorrect leader
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927833#action_12927833
 ] 

Alexandre Hardy commented on ZOOKEEPER-917:
---

Excerpt from logs on 192.168.130.10:
{noformat}
2010-11-02 09:36:28,060 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742
2010-11-02 09:36:28,061 WARN org.apache.zookeeper.server.NIOServerCnxn: 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
2010-11-02 09:36:28,061 INFO org.apache.zookeeper.server.NIOServerCnxn: closing 
session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/192.168.130.10:2181 remote=/192.168.130.10:37781]
2010-11-02 09:36:28,061 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 0, 
4294967742, 2, 0, LOOKING, LOOKING, 0
2010-11-02 09:36:28,063 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
2010-11-02 09:36:28,064 WARN org.apache.zookeeper.server.NIOServerCnxn: 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
2010-11-02 09:36:28,064 INFO org.apache.zookeeper.server.NIOServerCnxn: closing 
session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/192.168.130.10:2181 remote=/192.168.130.14:50222]
2010-11-02 09:36:28,064 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 
0, LOOKING, FOLLOWING, 1
2010-11-02 09:36:28,065 WARN org.apache.zookeeper.server.NIOServerCnxn: 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
2010-11-02 09:36:28,065 INFO org.apache.zookeeper.server.NIOServerCnxn: closing 
session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/192.168.130.10:2181 remote=/192.168.130.14:50223]
2010-11-02 09:36:28,068 WARN org.apache.zookeeper.server.NIOServerCnxn: 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
2010-11-02 09:36:28,068 INFO org.apache.zookeeper.server.NIOServerCnxn: closing 
session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/192.168.130.10:2181 remote=/192.168.130.12:59044]
2010-11-02 09:36:28,073 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 
0, LOOKING, LEADING, 2
2010-11-02 09:36:28,073 WARN org.apache.zookeeper.server.NIOServerCnxn: 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.NIOServerCnxn: closing 
session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/192.168.130.10:2181 remote=/192.168.130.10:37786]
2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.quorum.QuorumPeer: 
FOLLOWING
2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.ZooKeeperServer: 
Created server 
2010-11-02 09:36:28,074 INFO org.apache.zookeeper.server.quorum.Follower: 
Following zookeeper3/192.168.130.13:2888
{noformat}

Excerpt from logs on 192.168.130.11:
{noformat}
2010-11-02 09:36:14,065 INFO 
org.apache.zookeeper.server.quorum.QuorumPeerConfig: Defaulting to majority 
quorums
2010-11-02 09:36:14,120 INFO org.apache.zookeeper.server.quorum.QuorumPeerMain: 
Starting quorum peer
2010-11-02 09:36:14,172 INFO 
org.apache.zookeeper.server.quorum.QuorumCnxManager: My election bind port: 3888
2010-11-02 09:36:14,182 INFO org.apache.zookeeper.server.quorum.QuorumPeer: 
LOOKING
2010-11-02 09:36:14,183 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: New election: -1
2010-11-02 09:36:14,191 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 
2, LOOKING, LOOKING, 2
2010-11-02 09:36:14,191 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
2010-11-02 09:36:14,193 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting 
for message on queue
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1952)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:345)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:532)
2010-11-02 09:36:14,194 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker leaving thread
2010-11-02 09:36:14,194 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 
2, LOOKING, LOOKING, 1
2010-11-02 09:36:14,194 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
2010-11-02 09:36:14,195 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting 
for message on queue

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927836#action_12927836
]

Alexandre Hardy commented on ZOOKEEPER-917:
---

Sorry, that should have been 192.168.130.13, not 192.168.130.11.

Leader election selected incorrect leader
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927839#action_12927839
]

Alexandre Hardy commented on ZOOKEEPER-917:
---

We noted that certain ephemeral nodes were no longer behaving as expected
(started incrementing from zero again) and are concerned about the potential of
data loss since the latest zxid's don't seem to be recognized by the leader.

Leader election selected incorrect leader
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Alexandre Hardy (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927888#action_12927888
 ] 

Alexandre Hardy commented on ZOOKEEPER-917:
---

Hi Flavio,

The three zookeeper servers are zookeeper1, zookeeper2 and zookeeper3.
Initially the servers were
* 192.168.130.10: zookeeper1
* 192.168.130.11: zookeeper3
* 192.168.130.14: zookeeper2

After .11 was removed the servers were:
* 192.168.130.10: zookeeper1
* 192.168.130.13: zookeeper3
* 192.168.130.14: zookeeper2

All other settings were set by hbase:
* tickTime=2000
* initLimit=10
* syncLimit=5  
* peerport=2888
* leaderport=3888

zookeeper1 would have node id 0
zookeeper2 would have node id 1
zookeeper3 would have node id 2

I'm not sure what else I can give you concerning the configuration.

I note that in 192.168.130.14 (node id 1) we have 
{noformat}
2010-11-02 09:36:27,988 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742
2010-11-02 09:36:27,988 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 1, 
4294967742, 2, 1, LOOKING, LOOKING, 1
2010-11-02 09:36:27,988 INFO 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Have smaller server 
identifier, so dropping the connection: (2, 1)
2010-11-02 09:36:27,988 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
2010-11-02 09:36:27,989 INFO 
org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 
1, LOOKING, FOLLOWING, 0
{noformat}
 
I don't think there is much chance of some kind of networking configuration, 
but could that explain what we are seeing?



 Leader election selected incorrect leader
 -

 Key: ZOOKEEPER-917
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.2.2
 Environment: Cloudera distribution of zookeeper (patched to never 
 cache DNS entries)
 Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: zklogs-20101102144159SAST.tar.gz


 We had three nodes running zookeeper:
   * 192.168.130.10
   * 192.168.130.11
   * 192.168.130.14
 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
 (automated startup). The new node had not participated in any zookeeper 
 quorum previously. The node 192.148.130.11 was permanently removed from 
 service and could not contribute to the quorum any further (powered off).
 DNS entries were updated for the new node to allow all the zookeeper servers 
 to find the new node.
 The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
 it had not seen the latest zxid.
 This particular problem has not been verified with later versions of 
 zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-27 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925598#action_12925598
]

Alexandre Hardy commented on ZOOKEEPER-885:
---

Hi Flavio,

I've set up some ec2 instances to reproduce the problem. I think the
problem is related to relative disk performance and load.

I have had to use a more aggressive disk benchmark utility to get the
problem to occur, and I realise that this is contrary to the zookeeper
requirements. However, I think we would like to know why a ping would
be affected when no disk access is expected.

Can we discuss access to these instances vie e-mail?

Kind regards
Alexandre

Zookeeper drops connections under moderate IO load
--

Key: ZOOKEEPER-885
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
Project: Zookeeper
Issue Type: Bug
Components: server
Affects Versions: 3.2.2, 3.3.1
Environment: Debian (Lenny)
1Gb RAM
swap disabled
100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz,
WatcherTest.java, zklogs.tar.gz

A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.
In a specific test example we had three zookeeper servers running on
dedicated machines with 45 clients connected, watching exactly one node. The
clients would disconnect after moderate load was added to each of the
zookeeper servers with the command:
{noformat}
dd if=/dev/urandom of=/dev/mapper/nimbula-test
{noformat}
The {{dd}} command transferred data at a rate of about 4Mb/s.
The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
It seems strange that such a moderate load should cause instability in the
connection.
Very few other processes were running, the machines were setup to test the
connection instability we have experienced. Clients performed no other read
or mutation operations.
Although the documents state that minimal competing IO load should present on
the zookeeper server, it seems reasonable that moderate IO should not cause
problems in this case.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921244#action_12921244
]

Alexandre Hardy commented on ZOOKEEPER-885:
---

{quote}
Hi Alexandre, When you load the machines running the zookeeper servers by
running the dd command, how much time elapses between running dd and observing
the connections expiring? I'm not being able to reproduce it, and I wonder how
long the problem takes to manifest.
{quote}

Hi Flavio,

Problems usually start occurring after about 30 seconds. I have also tested on
some other machines and response varies somewhat. I suspect that the 30 seconds
is dictated by some queue that needs to fill up before significant disk traffic
is initiated. I think that the speed at which this queue is processed
determines how likely it is that zookeeper will fail to respond to a ping.

Please also see responses to Patrick's questions.

Zookeeper drops connections under moderate IO load
--

Key: ZOOKEEPER-885
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
Project: Zookeeper
Issue Type: Bug
Components: server
Affects Versions: 3.2.2, 3.3.1
Environment: Debian (Lenny)
1Gb RAM
swap disabled
100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz,
WatcherTest.java, zklogs.tar.gz

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921251#action_12921251
 ] 

Alexandre Hardy commented on ZOOKEEPER-885:
---

Hi Patrick,

{quote}
1) you are applying the load using dd to all three servers at the same time, is 
that correct? (not just to 1 server)
{quote}
Correct. If {{dd}} is run on only one machine then the likelihood of 
disconnects is reduced. Unfortunately our typical scenario would involve load 
on all three machines.

{quote}
2) /dev/mapper indicates some sort of lvm setup, can you give more detail on 
that? (fyi http://ubuntuforums.org/showthread.php?t=646340)
{quote}
Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume 
is 10G in size and (obviously) shares the same spindle as root and log 
(/var/log) partitions. 

{quote}
3) you mentioned that this:

echo 5  /proc/sys/vm/dirty_ratio
echo 5  /proc/sys/vm/dirty_background_ratio

resulting in stability in this test, can you tell us what this was set to 
initially?

Checkout this article: http://lwn.net/Articles/216853/
{quote}

The initial value for {{/proc/sys/vm/dirty_ratio}} is 20, an the initial value 
for {{/proc/sys/vm/dirty_background_ratio}} is 10. These machines have 1G of 
RAM, and thus are less susceptible
to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I 
have run a more complete benchmark with random IO instead of {{dd}} sequential 
IO testing session timeouts, and the effect of {{dirty_ratio}} settings. I will 
attach that separately. {{dirty_ratio}} seems to help with the {{dd}} test but 
has much less influence in the random IO test.

{quote}
I notice you are running a bigmem kernel. What's the total memory size? How 
large of a heap have to assigned to the ZK server? (jvm)
{quote}
We have 1G on each machine in this test system and 100M heap size for each 
zookeeper server.

{quote}
4) Can you verify whether or not the JVM is swapping? Any chance that the 
server JVM is swapping, which is causing the server to pause, which then causes 
the clients to time out? This seems to me like it would fit the scenario - esp 
given that when you turn the dirty_ratio down you see stability increase (the 
time it would take to complete the flush would decrease, meaning that the 
server can respond before the client times out).
{quote}
I'm not entirely sure of all the JVM internals, but all swap space on the linux 
system was disabled. So no swapping based on the linux kernel would happen. I'm 
not sure if the JVM does any swapping of its own?
I concur with your analysis. What puzzles me is why the system would even get 
into a state where the zookeeper server would have to wait so long for a disk 
flush? In the case of {{dd if=/dev/urandom}} the IO rate is quite low, and 
there should (I think) be more than enough IOPS available for zookeeper to 
flush data to disk in time. Even if the IO scheduling results in this scenario, 
it is still not clear to me why zookeeper would fail to respond to a ping. My 
only conclusion at this stage is that responding to a ping requires information 
to be flushed to disk. Is this correct?

Referring to your private e-mail:
{quote}
 The weird thing here is that there should be no delay for these pings.
{quote}
This would indicate to me that the ping response should not be dependent on any 
disk IO.

Thanks for all the effort in looking into this!

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: benchmark.csv

Here are some results with different settings of {{dirty_ratio}}, 
{{dirty_bytes}} (finer control), session timeouts, and io priorities (set with 
{{ionice}}). 

The status field indicates success or failure, 0=success and anything else is 
failure.
Where failure means a zookeeper session disconnected (but did not necessarily 
expire).

Each test ran for a maximum of 5 minutes. A test could also fail if it failed 
to connect to zookeeper servers within the first 60 seconds, unfortunately the 
return code did not differentiate properly between these cases. However pauses 
of about 4 seconds were allowed between tests, during which all IO operations 
(by the test program) were stopped. This should have allowed the system to 
stabilize somewhat.

In this test the session timeout had the most significant influence on 
stability (not too surprising) and the other system settings had far less 
influence.

We have settled on a 60s session timeout for the moment (to achieve the 
stability we need under the IO loads we are experiencing). But it would be 
great if we could reduce this a bit.



 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-12 Thread Alexandre Hardy (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Hardy updated ZOOKEEPER-885:
--

Affects Version/s: 3.3.1

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-08 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: tracezklogs.tar.gz

I accidentally missed the configuration for one of the nodes (switching from
3.2.2 to 3.3.1). Thanks for spotting that.

Here are updated log files. I have updated the test program to terminate as
soon as any single connection is dropped (Since my goal is to have zero
connection failures for this simple test).

The client and server clocks are a bit out of sync (about 4 minutes, with the
servers logging in UTC and client logging in UTC+2), but the test has been set
up so that only one instability event is recorded.

Zookeeper drops connections under moderate IO load
--

Key: ZOOKEEPER-885
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
Project: Zookeeper
Issue Type: Bug
Components: server
Affects Versions: 3.2.2
Environment: Debian (Lenny)
1Gb RAM
swap disabled
100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
Attachments: tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-08 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: tracezklogs.tar.gz

I failed to set up the debug logs correctly in the previous attachment. This
attachment has full trace logs.

Zookeeper drops connections under moderate IO load
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-07 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918940#action_12918940
]

Alexandre Hardy commented on ZOOKEEPER-885:
---

{{maxClientCnxns}} is set to 30, so 45 clients spread across 3 servers should
not be unreasonable, and I do have confirmation that a session is established
for every client (all 45 of them) before beginning the disk load with {{dd}}.

I'm aiming for 0 disconnects with this simple example.

Zookeeper drops connections under moderate IO load
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-07 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: zklogs.tar.gz

Attached are logs from the two sessions with disconnects. I have not filtered
the logs in any way. The logs for 3.3.1 are the most clear, and have exactly
one failure roughly 3 minutes after the logs start. Unfortunately the logs
don't offer much information (as far as I can make out). Should I enable more
verbose logging?

Zookeeper drops connections under moderate IO load
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-05 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917987#action_12917987
]

Alexandre Hardy commented on ZOOKEEPER-885:
---

Thanks Patrick.

I have now performed the same test with zookeeper-3.3.1 as distributed in the
Cloudera CDH3 distribution. My results are fairly similar (I don't have hard
numbers for comparison though). I still experience disconnects and session
timeouts even though {{dd}} transfers at only 5Mb per second.

I have had improvements in stability with this test by adjusting the following:
{noformat}
echo 5 /proc/sys/vm/dirty_ratio
echo 5 /proc/sys/vm/dirty_background_ratio
{noformat}

These settings are sufficient to get stability on this simple test.

I'll work on getting some more concrete comparisons of zookeeper-3.2.2 and
zookeeper-3.3.1.

The output of the stat command seems reasonable (0/10/91) during {{dd}}. It
seems to me that the zookeeper server runs fairly well for a period of time,
and then hits a period of instability, and then recovers again. Are pings also
counted for server latency?

I suspect that IO scheduling / buffering may be partly responsible for the
problem, and hence started testing the effect of {{dirty_ratio}} etc. At this
stage I am presuming that {{dd}} manages to write a fair amount of data before
the data is actually flushed to disk, and that when it is flushed all processes
attempting to write to the device are stalled until the flush completes.

What is unclear to me at this point, is what gets written to disk? What would
zookeeper be writing to disk if none of the clients are submitting any
requests? Something to do with the session?

Zookeeper drops connections under moderate IO load
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-01 Thread Alexandre Hardy (JIRA)

Zookeeper drops connections under moderate IO load
--

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2
 Environment: Debian (Lenny)
1Gb RAM
swap disabled
100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical


A zookeeper server under minimum load, with a number of clients watching 
exactly one node will fail to maintain the connection when the machine is 
subjected to moderate IO load.

In a specific test example we had three zookeeper servers running on dedicated 
machines with 45 clients connected, watching exactly one node. The clients 
would disconnect after moderate load was added to each of the zookeeper servers 
with the command:
{noformat}
dd if=/dev/urandom of=/dev/mapper/nimbula-test
{noformat}

The {{dd}} command transferred data at a rate of about 4Mb/s.

The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
transferring at a rate of about 20Mb/s. 

It seems strange that such a moderate load should cause instability in the 
connection.

Very few other processes were running, the machines were setup to test the 
connection instability we have experienced. Clients performed no other read or 
mutation operations.

Although the documents state that minimal competing IO load should present on 
the zookeeper server, it seems reasonable that moderate IO should not cause 
problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-01 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: WatcherTest.java

I have attached the client used to test the zookeeper servers.

Zookeeper drops connections under moderate IO load
--

A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.
In a specific test example we had three zookeeper servers running on
dedicated machines with 45 clients connected, watching exactly one node. The
clients would disconnect after moderate load was added to each of the
zookeeper servers with the command:
{noformat}
dd if=/dev/urandom of=/dev/mapper/nimbula-test
{noformat}
The {{dd}} command transferred data at a rate of about 4Mb/s.
The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
transferring at a rate of about 20Mb/s.
It seems strange that such a moderate load should cause instability in the
connection.
Very few other processes were running, the machines were setup to test the
connection instability we have experienced. Clients performed no other read
or mutation operations.
Although the documents state that minimal competing IO load should present on
the zookeeper server, it seems reasonable that moderate IO should not cause
problems in this case.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-01 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Description:
A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.

In a specific test example we had three zookeeper servers running on dedicated
machines with 45 clients connected, watching exactly one node. The clients
would disconnect after moderate load was added to each of the zookeeper servers
with the command:
{noformat}
dd if=/dev/urandom of=/dev/mapper/nimbula-test
{noformat}

The {{dd}} command transferred data at a rate of about 4Mb/s.

The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
transferring at a rate of about 120Mb/s.

It seems strange that such a moderate load should cause instability in the
connection.

Very few other processes were running, the machines were setup to test the
connection instability we have experienced. Clients performed no other read or
mutation operations.

Although the documents state that minimal competing IO load should present on
the zookeeper server, it seems reasonable that moderate IO should not cause
problems in this case.

was:
A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.

The {{dd}} command transferred data at a rate of about 4Mb/s.

The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
transferring at a rate of about 20Mb/s.

It seems strange that such a moderate load should cause instability in the
connection.

Very few other processes were running, the machines were setup to test the
connection instability we have experienced. Clients performed no other read or
mutation operations.

Although the documents state that minimal competing IO load should present on
the zookeeper server, it seems reasonable that moderate IO should not cause
problems in this case.

Zookeeper drops connections under moderate IO load
--

A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.
In a specific test example we had three zookeeper servers running on
dedicated machines with 45 clients connected, watching exactly one node. The
clients would disconnect after moderate load was added to each of the
zookeeper servers with the command:
{noformat}
dd if=/dev/urandom of=/dev/mapper/nimbula-test
{noformat}
The {{dd}} command transferred data at a rate of about 4Mb/s.
The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}
transferring at a rate of about 120Mb/s.
It seems strange that such a moderate load should cause instability in the
connection.
Very few other processes were running, the machines were setup to test the
connection instability we have experienced. Clients performed no other read
or mutation operations.
Although the documents state that minimal competing IO load should present on
the zookeeper server, it seems reasonable that moderate IO should not cause
problems in this case.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-01 Thread Alexandre Hardy (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Hardy updated ZOOKEEPER-885:
--

Description:
A zookeeper server under minimum load, with a number of clients watching
exactly one node will fail to maintain the connection when the machine is
subjected to moderate IO load.

The {{dd}} command transferred data at a rate of about 4Mb/s.

The same thing happens with
{noformat}
dd if=/dev/zero of=/dev/mapper/nimbula-test
{noformat}