Replying to the list to make sure that Ben's responses get archived below On Wed, 14 Jan 2015, Ben Smith wrote:
> On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <[email protected]> > wrote: > > > Do you have anything running along with the ZK servers on the same > > machine? > > > No the zookeeper machines are 'dedicated'. The only other stuff running is > infrastructure related (metrics and log collection agents that incur no/low > load). > > > > Is that a virtualized environment? > > > Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The > instance CPU and Memory resources are really low. > > > > I'm wondering if a concurrent workload is causing servers to lag behind > > and drop the ensemble. Also, how many disk devices are you using? is there > > a dedicated device for the txn log? > > > > They are only using one device each, it is not dedicated, but it also isn't > anywhere close to saturation, using iostat to validate svctm: > > https://gist.github.com/polynomial/2dc6aefefd7fa31afc73 > > Here is the config if that helps too: > > $ echo 'conf' | nc localhost 2181 > clientPort=2181 > dataDir=/path/to/zookeeper/snapshots/version-2 > dataLogDir=/path/to/zookeeper/transactions/version-2 > tickTime=2000 > maxClientCnxns=60 > minSessionTimeout=4000 > maxSessionTimeout=40000 > serverId=4 > initLimit=10 > syncLimit=5 > electionAlg=3 > electionPort=3888 > quorumPort=2888 > peerType=0 > > -ben > > > > > > -Flavio > > > > > On 14 Jan 2015, at 19:07, R. Tyler Croy <[email protected]> wrote: > > > > > > (replies inline) > > > > > > On Wed, 14 Jan 2015, Flavio Junqueira wrote: > > > > > >> I'm not sure what could be causing the communication between servers to > > fail. > > >> Are you sure this is a ZK problem? Could it be a problem with your > > setting? > > > > > > > > > We're not certain of much at this point to be honest. We've executed some > > > network stress tests and see no disruptions in the layer 1-3 network > > > communications. > > > > > > I'm not aware of any Zookeeper server-side setting that would result in > > the > > > behavior we're seeing but I'm happy to try out some configuration > > changes if > > > someboyd believes those will either help alleviate the problem or help > > identify > > > it with more exactness. > > > > > > > > > > > > > > >> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy < > > [email protected]> wrote: > > >> > > >> > > >> A colleague of mine and I are still fully trying to understand the > > behavior > > >> we're seeing from one of our Zookeeper clusters running 3.4.6. > > >> > > >> Some log snippets are included further down, but the symptom of what > > we're > > >> seeing is that nodes in the cluster periodically detach themselves > > entirely > > >> from the rest of the cluster. The period is not however consistent, in > > at least > > >> one known case we've seen multiple nodes detach around the same time > > causing > > >> leader re-election challenges (putting it kindly!). > > >> > > >> Our investigation has led us to see a large number of transaction errors > > >> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea > > of how > > >> large of a number we're talking about, here's a summary from ~20min of > > logs: > > >> > > >> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | > > sort | uniq -c | sort -nr > > >> 70585 txntype:-1 reqpath:n/a > > >> 4860 txntype:5 reqpath:n/a > > >> 7 txntype:-11 reqpath:n/a > > >> 7 txntype:-10 reqpath:n/a > > >> [..snip..] > > >> > > >> The number of session create/close transaction types looks about right > > as does > > >> the number of 'set data' transactions; however the "reqpath" being > > "n/a" does > > >> not look entirely correct. > > >> > > >> > > >> We're only using Zookeeper to hold offsets for normal Kafka high-level > > >> consumers and Storm's KafkaSpout consumers, and I'm relatively > > confident that > > >> those consumers are behaving correctly. > > >> > > >> > > >> We have also come across a number of ostensibly resolved bugs in JIRA > > that, to > > >> our untrained eyes, look similar to what we're seeing: > > >> > > >> "Zookeeper 3.3.4 loses ephemeral nodes under stress" > > >> <https://issues.apache.org/jira/browse/ZOOKEEPER-1740> > > >> "the threads number of a zookeeper is increased all the time" > > >> <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing > > the > > >> thread growth mentioned here, just log similarities) > > >> > > >> > > >> The node detaching issue is not something we've been able to recreate > > on our > > >> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on > > OpenJDK > > >> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide > > >> information that might help determine the cause of our problems! > > >> > > >> > > >> Cheers (logs snippets to follow) > > >> > > >> > > >> Error logs from an incident where a node detached itself: > > >> > > >> 2015-01-02 20:10:37,576 [myid:4] - ERROR > > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected > > exception causing shutdown while sock still open > > >> java.net.SocketTimeoutException: Read timed out > > >> at java.net.SocketInputStream.socketRead0(Native Method) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:152) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:122) > > >> at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > > >> at > > java.io.BufferedInputStream.read(BufferedInputStream.java:254) > > >> at java.io.DataInputStream.readInt(DataInputStream.java:387) > > >> at > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > >> at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > >> at > > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > > >> at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > > >> 2015-01-02 20:10:37,577 [myid:4] - ERROR > > [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected > > exception causing shutdown while sock still open > > >> java.net.SocketTimeoutException: Read timed out > > >> at java.net.SocketInputStream.socketRead0(Native Method) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:152) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:122) > > >> at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > > >> at > > java.io.BufferedInputStream.read(BufferedInputStream.java:254) > > >> at java.io.DataInputStream.readInt(DataInputStream.java:387) > > >> at > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > >> at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > >> at > > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > > >> at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > > >> 2015-01-02 20:10:37,577 [myid:4] - WARN > > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE > > /10.96.100.69:54260 ******** > > >> 2015-01-02 20:10:37,576 [myid:4] - ERROR > > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected > > exception causing shutdown while sock still open > > >> java.net.SocketTimeoutException: Read timed out > > >> at java.net.SocketInputStream.socketRead0(Native Method) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:152) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:122) > > >> at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > > >> at > > java.io.BufferedInputStream.read(BufferedInputStream.java:254) > > >> at java.io.DataInputStream.readInt(DataInputStream.java:387) > > >> at > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > >> at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > >> at > > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > > >> at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > > >> 2015-01-02 20:10:37,578 [myid:4] - WARN > > [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE > > /10.96.14.103:47982 ******** > > >> 2015-01-02 20:10:37,577 [myid:4] - ERROR > > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected > > exception causing shutdown while sock still open > > >> java.net.SocketTimeoutException: Read timed out > > >> at java.net.SocketInputStream.socketRead0(Native Method) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:152) > > >> at java.net.SocketInputStream.read(SocketInputStream.java:122) > > >> at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > > >> at > > java.io.BufferedInputStream.read(BufferedInputStream.java:254) > > >> at java.io.DataInputStream.readInt(DataInputStream.java:387) > > >> at > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > >> at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > >> at > > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > > >> at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > > >> 2015-01-02 20:10:37,578 [myid:4] - WARN > > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE > > /10.84.11.131:47798 ******** > > >> 2015-01-02 20:10:37,578 [myid:4] - WARN > > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* > > GOODBYE /10.96.100.136:48175 ******** > > >> > > >> > > >> > > >> Some trace logs from "normal operation" that looked suspect to us: > > >> > > >> > > >> 2015-01-12 14:05:12,753 [myid:1] - TRACE > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData รข?????? > > session in archive: 381306787892823513 with timeout: 6000 > > >> 2015-01-12 14:05:12,797 [myid:1] - INFO > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - > > Snapshotting: 0xb3ea0ee8f4db to > > /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db > > >> 2015-01-12 14:05:12,822 [myid:1] - WARN > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid > > 0xb3ea0ee8f4dc expected 0x1 > > >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - > > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a > > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG > > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: > > sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc > > txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] > > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a > > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] > > - Ignoring processTxn failure hdr: -1 : error: -110 > > >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - > > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d > > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG > > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: > > sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd > > txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] > > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d > > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] > > - Ignoring processTxn failure hdr: -1 : error: -110 > > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - > > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 > > zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a > > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - > > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 > > zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a > > >> > > >> > > >> > > > >
signature.asc
Description: Digital signature
