[jira] Commented: (ZOOKEEPER-907) Spurious "KeeperErrorCode = Session moved" messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928558#action_12928558 ] Hudson commented on ZOOKEEPER-907: -- Integrated in ZooKeeper-trunk #991 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/]) ZOOKEEPER-907. Spurious "KeeperErrorCode = Session moved" messages > Spurious "KeeperErrorCode = Session moved" messages > --- > > Key: ZOOKEEPER-907 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 > Project: Zookeeper > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Vishal K >Assignee: Vishal K >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2 > > > The sync request does not set the session owner in Request. > As a result, the leader keeps printing: > 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - > Got user-level KeeperException when processing sessionid:0x298d3b1fa9 > type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error > Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-884) Remove LedgerSequence references from BookKeeper documentation and comments in tests
[ https://issues.apache.org/jira/browse/ZOOKEEPER-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928557#action_12928557 ] Hudson commented on ZOOKEEPER-884: -- Integrated in ZooKeeper-trunk #991 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/]) ZOOKEEPER-884. Remove LedgerSequence references from BookKeeper documentation and comments in tests > Remove LedgerSequence references from BookKeeper documentation and comments > in tests > - > > Key: ZOOKEEPER-884 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-884 > Project: Zookeeper > Issue Type: Bug > Components: contrib-bookkeeper >Affects Versions: 3.3.1 >Reporter: Flavio Junqueira >Assignee: Flavio Junqueira > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-884.patch > > > We no longer use LedgerSequence, so we need to remove references in > documentation and comments sprinkled throughout the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-916) Problem receiving messages from subscribed channels in c++ client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928559#action_12928559 ] Hudson commented on ZOOKEEPER-916: -- Integrated in ZooKeeper-trunk #991 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/]) ZOOKEEPER-916. Problem receiving messages from subscribed channels in c++ client > Problem receiving messages from subscribed channels in c++ client > -- > > Key: ZOOKEEPER-916 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-916 > Project: Zookeeper > Issue Type: Bug > Components: contrib-hedwig >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Attachments: ZOOKEEPER-916.patch > > > We see this bug with receiving messages from a subscribed channel. This > problem seems to happen with larger messages. The flow is to first read at > least 4 bytes from the socket channel. Extract the first 4 bytes to get the > message size. If we've read enough data into the buffer already, we're done > so invoke the messageReadCallbackHandler passing the channel and message > size. If not, then do an async read for at least the remaining amount of > bytes in the message from the socket channel. When done, invoke the > messageReadCallbackHandler. > The problem seems that when the second async read is done, the same > sizeReadCallbackHandler is invoked instead of the messageReadCallbackHandler. > The result is that we then try to read the first 4 bytes again from the > buffer. This will get a random message size and screw things up. I'm not > sure if it's an incorrect use of the boost asio async_read function or we're > doing the boost bind to the callback function incorrectly. > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler system:0,512 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: size of buffer before reading message > size: 512 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: size of incoming message 599, > currently in buffer 508 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: Still have more data to read, 91 from > channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler system:0, 91 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: size of buffer before reading message > size: 599 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: size of incoming message 134287360, > currently in buffer 595 channel(0x80b7a18) > 101015 15:30:40.108 DEBUG hedwig.channel.cpp - > DuplexChannel::sizeReadCallbackHandler: Still have more data to read, > 134286765 from channel(0x80b7a18) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928566#action_12928566 ] Flavio Junqueira commented on ZOOKEEPER-918: This is really nice, Amit, thanks. I haven't had a chance to go carefully over the document, but my first reaction is that this should be a live document, and perhaps a wiki page would suit this purpose well. What do you think? > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Priority: Trivial > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets
[ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928590#action_12928590 ] Vishal K commented on ZOOKEEPER-900: Hi Flavio, Thanks for your feedback. I will do the code changes. For point 2 above, I was referring to the code that deletes the SenderWorker and ReceiveWorker pair after receiving a connect request. I was concerned that a peer might send frequent connect request before to the remote peer before the remote peer can initiate connection back. But I think the Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); in lookForLeader will prevent this scenario. Also, this won't be a concern if we decide to remove the part that kills the pair for each connect. I am also thinking of adding a sanity check that will accept connections only from peers that are not listed in the zoo.cfg file or OBSERVER_ID. I have not used observes so far. Can you please explain why a node will use OBSERVER_ID instead of its sid? In particular, I am referring to the following code in QuorumCnxManager: // Read server id sid = Long.valueOf(msgBuffer.getLong()); if(sid == QuorumPeer.OBSERVER_ID){ /* * Choose identifier at random. We need a value to identify * the connection. */ sid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + sid); } > FLE implementation should be improved to use non-blocking sockets > - > > Key: ZOOKEEPER-900 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 > Project: Zookeeper > Issue Type: Bug >Reporter: Vishal K >Assignee: Flavio Junqueira >Priority: Critical > > From earlier email exchanges: > 1. Blocking connects and accepts: > a) The first problem is in manager.toSend(). This invokes connectOne(), which > does a blocking connect. While testing, I changed the code so that > connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() > does a socketChannel.connect(). After starting AsyncConnect, connectOne > starts a timer. connectOne continues with normal operations if the connection > is established before the timer expires, otherwise, when the timer expires it > interrupts AsyncConnect() thread and returns. In this way, I can have an > upper bound on the amount of time we need to wait for connect to succeed. Of > course, this was a quick fix for my testing. Ideally, we should use Selector > to do non-blocking connects/accepts. I am planning to do that later once we > at least have a quick fix for the problem and consensus from others for the > real fix (this problem is big blocker for us). Note that it is OK to do > blocking IO in SenderWorker and RecvWorker threads since they block IO to the > respective ! peer. > b) The blocking IO problem is not just restricted to connectOne(), but also > in receiveConnection(). The Listener thread calls receiveConnection() for > each incoming connection request. receiveConnection does blocking IO to get > peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the > peer that had sent the connection request. All of this is happening from the > Listener. In short, if a peer fails after initiating a connection, the > Listener thread won't be able to accept connections from other peers, because > it would be stuck in read() or connetOne(). Also the code has an inherent > cycle. initiateConnection() and receiveConnection() will have to be very > carefully synchronized otherwise, we could run into deadlocks. This code is > going to be difficult to maintain/modify. > Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928605#action_12928605 ] Vishal K commented on ZOOKEEPER-917: Hi Flavio, Sorry for not making much progress on (http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership). I have spent some time to understand the code. But It is a bit difficult to focus on development without dedicated development time. I am pushing to get dedicated development time at work for this so that I don't have to rely on my spare time. Few questions related to your comments: 1. Can you please elaborate on : "At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later." 2. Why is it not OK for B to give up leadership when it sees that its is lower than others? Thanks. > Leader election selected incorrect leader > - > > Key: ZOOKEEPER-917 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917 > Project: Zookeeper > Issue Type: Bug > Components: leaderElection, server >Affects Versions: 3.2.2 > Environment: Cloudera distribution of zookeeper (patched to never > cache DNS entries) > Debian lenny >Reporter: Alexandre Hardy >Priority: Critical > Fix For: 3.3.3, 3.4.0 > > Attachments: zklogs-20101102144159SAST.tar.gz > > > We had three nodes running zookeeper: > * 192.168.130.10 > * 192.168.130.11 > * 192.168.130.14 > 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 > (automated startup). The new node had not participated in any zookeeper > quorum previously. The node 192.148.130.11 was permanently removed from > service and could not contribute to the quorum any further (powered off). > DNS entries were updated for the new node to allow all the zookeeper servers > to find the new node. > The new node 192.168.130.13 was selected as the LEADER, despite the fact that > it had not seen the latest zxid. > This particular problem has not been verified with later versions of > zookeeper, and no attempt has been made to reproduce this problem as yet. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928642#action_12928642 ] Amit Jaiswal commented on ZOOKEEPER-918: Thats a good suggestion. But I don't have access to create a new wiki page. Also, just saw some couple of new wiki pages devoted to bookkeeper performance, bookie recovery. Please let me know how to publish this in wiki format. I am attaching the original doc format file too in case someone wants to take relevant section and publish in different wikis. > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Priority: Trivial > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Jaiswal updated ZOOKEEPER-918: --- Attachment: BookKeeperInternals.doc Attaching the original document file. > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Priority: Trivial > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928697#action_12928697 ] Patrick Hunt commented on ZOOKEEPER-918: There are really two options for docs (today): 1) put it into svn as a forrest doc. typically this is for documentation that's version specific - needs to be versioned along with the code 2) put it into wiki, usually this is non-version specific detail. putting into svn requires a patch for each change, which adds to the overhead. another way to go is to start on the wiki, once the doc is fairly stable move it to svn. > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Priority: Trivial > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-918: --- Priority: Minor (was: Trivial) > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Assignee: Amit Jaiswal >Priority: Minor > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-918: --- Assignee: Amit Jaiswal > Review of BookKeeper Documentation (Sequence flow and failure scenarios) > > > Key: ZOOKEEPER-918 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918 > Project: Zookeeper > Issue Type: Task > Components: documentation >Reporter: Amit Jaiswal >Assignee: Amit Jaiswal >Priority: Trivial > Fix For: 3.3.3, 3.4.0 > > Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf > > Original Estimate: 2h > Remaining Estimate: 2h > > I have prepared a document describing some of the internals of bookkeeper in > terms of: > 1. Sequence of operations > 2. Files layout > 3. Failure scenarios > The document is prepared by mostly by reading the code. Can somebody who > understands the design review the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-909: Status: Open (was: Patch Available) once a couple of small changes are made to this patch, we should be good to go. > Extract NIO specific code from ClientCnxn > - > > Key: ZOOKEEPER-909 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909 > Project: Zookeeper > Issue Type: Sub-task > Components: java client >Reporter: Thomas Koch >Assignee: Thomas Koch > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, > ZOOKEEPER-909.patch > > > This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus > everything Netty related. This means this patch only extract all NIO specific > code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket. > I've redone this patch from current trunk step by step now and couldn't find > any logical error. I've already done a couple of successful test runs and > will continue to do so this night. > It would be nice, if we could apply this patch as soon as possible to trunk. > This allows us to continue to work on the netty integration without blocking > the ClientCnxn class. Adding Netty after this patch should be only a matter > of adding the ClientCnxnSocketNetty class with the appropriate test cases. > You could help me by reviewing the patch and by running it on whatever test > server you have available. Please send me any complete failure log you should > encounter to thomas at koch point ro. Thx! > Update: Until now, I've collected 8 successful builds in a row! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.