[jira] Commented: (ZOOKEEPER-907) Spurious "KeeperErrorCode = Session moved" messages

2010-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928558#action_12928558
 ] 

Hudson commented on ZOOKEEPER-907:
--

Integrated in ZooKeeper-trunk #991 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/])
ZOOKEEPER-907. Spurious "KeeperErrorCode = Session moved" messages


> Spurious "KeeperErrorCode = Session moved" messages
> ---
>
> Key: ZOOKEEPER-907
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2
>
>
> The sync request does not set the session owner in Request.
> As a result, the leader keeps printing:
> 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
> Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
> type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
> Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-884) Remove LedgerSequence references from BookKeeper documentation and comments in tests

2010-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928557#action_12928557
 ] 

Hudson commented on ZOOKEEPER-884:
--

Integrated in ZooKeeper-trunk #991 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/])
ZOOKEEPER-884. Remove LedgerSequence references from BookKeeper 
documentation and comments in tests


> Remove LedgerSequence references from BookKeeper documentation and comments 
> in tests 
> -
>
> Key: ZOOKEEPER-884
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-884
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-bookkeeper
>Affects Versions: 3.3.1
>Reporter: Flavio Junqueira
>Assignee: Flavio Junqueira
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-884.patch
>
>
> We no longer use LedgerSequence, so we need to remove references in 
> documentation and comments sprinkled throughout the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-916) Problem receiving messages from subscribed channels in c++ client

2010-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928559#action_12928559
 ] 

Hudson commented on ZOOKEEPER-916:
--

Integrated in ZooKeeper-trunk #991 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/])
ZOOKEEPER-916. Problem receiving messages from subscribed channels in c++ 
client


> Problem receiving messages from subscribed channels in c++ client 
> --
>
> Key: ZOOKEEPER-916
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-916
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-hedwig
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: ZOOKEEPER-916.patch
>
>
> We see this bug with receiving messages from a subscribed channel.  This 
> problem seems to happen with larger messages.  The flow is to first read at 
> least 4 bytes from the socket channel. Extract the first 4 bytes to get the 
> message size.  If we've read enough data into the buffer already, we're done 
> so invoke the messageReadCallbackHandler passing the channel and message 
> size.  If not, then do an async read for at least the remaining amount of 
> bytes in the message from the socket channel.  When done, invoke the 
> messageReadCallbackHandler.
> The problem seems that when the second async read is done, the same 
> sizeReadCallbackHandler is invoked instead of the messageReadCallbackHandler. 
>  The result is that we then try to read the first 4 bytes again from the 
> buffer.  This will get a random message size and screw things up.  I'm not 
> sure if it's an incorrect use of the boost asio async_read function or we're 
> doing the boost bind to the callback function incorrectly.
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler system:0,512 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: size of buffer before reading message 
> size: 512 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: size of incoming message 599, 
> currently in buffer 508 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: Still have more data to read, 91 from 
> channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler system:0, 91 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: size of buffer before reading message 
> size: 599 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: size of incoming message 134287360, 
> currently in buffer 595 channel(0x80b7a18)
> 101015 15:30:40.108 DEBUG hedwig.channel.cpp - 
> DuplexChannel::sizeReadCallbackHandler: Still have more data to read, 
> 134286765 from channel(0x80b7a18)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928566#action_12928566
 ] 

Flavio Junqueira commented on ZOOKEEPER-918:


This is really nice, Amit, thanks. I haven't had a chance to go carefully over 
the document, but my first reaction is that this should be a live document, and 
perhaps a wiki page would suit this purpose well. What do you think?

> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Priority: Trivial
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

2010-11-05 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928590#action_12928590
 ] 

Vishal K commented on ZOOKEEPER-900:


Hi Flavio,

Thanks for your feedback. I will do the code changes.

For point 2 above, I was referring to the code that deletes the SenderWorker 
and ReceiveWorker pair after receiving a connect request. I was concerned that 
a peer might send frequent connect request before to the remote peer before the 
remote peer can initiate connection back. But I think the 
Notification n = recvqueue.poll(notTimeout,  TimeUnit.MILLISECONDS); in 
lookForLeader will prevent this scenario. Also, this won't be a concern if we 
decide to remove the part that kills the pair for each connect.

I am also thinking of adding a sanity check that will accept connections only 
from peers that are not listed in the zoo.cfg file or OBSERVER_ID.
I have not used observes so far. Can you please explain why a node will use 
OBSERVER_ID instead of its sid? In particular, I am referring to the following 
code in QuorumCnxManager:
// Read server id
sid = Long.valueOf(msgBuffer.getLong());
if(sid == QuorumPeer.OBSERVER_ID){
/*
 * Choose identifier at random. We need a value to identify
 * the connection.
 */

sid = observerCounter--;
LOG.info("Setting arbitrary identifier to observer: " + sid);
}

> FLE implementation should be improved to use non-blocking sockets
> -
>
> Key: ZOOKEEPER-900
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
> Project: Zookeeper
>  Issue Type: Bug
>Reporter: Vishal K
>Assignee: Flavio Junqueira
>Priority: Critical
>
> From earlier email exchanges:
> 1. Blocking connects and accepts:
> a) The first problem is in manager.toSend(). This invokes connectOne(), which 
> does a blocking connect. While testing, I changed the code so that 
> connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() 
> does a socketChannel.connect(). After starting AsyncConnect, connectOne 
> starts a timer. connectOne continues with normal operations if the connection 
> is established before the timer expires, otherwise, when the timer expires it 
> interrupts AsyncConnect() thread and returns. In this way, I can have an 
> upper bound on the amount of time we need to wait for connect to succeed. Of 
> course, this was a quick fix for my testing. Ideally, we should use Selector 
> to do non-blocking connects/accepts. I am planning to do that later once we 
> at least have a quick fix for the problem and consensus from others for the 
> real fix (this problem is big blocker for us). Note that it is OK to do 
> blocking IO in SenderWorker and RecvWorker threads since they block IO to the 
> respective !
 peer.
> b) The blocking IO problem is not just restricted to connectOne(), but also 
> in receiveConnection(). The Listener thread calls receiveConnection() for 
> each incoming connection request. receiveConnection does blocking IO to get 
> peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the 
> peer that had sent the connection request. All of this is happening from the 
> Listener. In short, if a peer fails after initiating a connection, the 
> Listener thread won't be able to accept connections from other peers, because 
> it would be stuck in read() or connetOne(). Also the code has an inherent 
> cycle. initiateConnection() and receiveConnection() will have to be very 
> carefully synchronized otherwise, we could run into deadlocks. This code is 
> going to be difficult to maintain/modify.
> Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-05 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928605#action_12928605
 ] 

Vishal K commented on ZOOKEEPER-917:


Hi Flavio,

Sorry for not making much progress on 
(http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership). I have spent some 
time to understand the code. But It is a bit difficult to focus on development 
without dedicated development time. I am pushing to get dedicated development 
time at work for this so that I don't have to rely on my spare time. 

Few questions related to your comments:
1. Can you please elaborate on : "At the same time, a server A decides to 
follow another server B if it receives a message from B saying that B is 
leading and from a quorum saying that they are following, even if A is in a 
later election epoch. This mechanism is there to avoid A being locked out of 
the ensemble in the case it partitions away and comes back later."

2. Why is it not OK for B to give up leadership when it sees that its 
 is lower than others?

Thanks.


> Leader election selected incorrect leader
> -
>
> Key: ZOOKEEPER-917
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
> Project: Zookeeper
>  Issue Type: Bug
>  Components: leaderElection, server
>Affects Versions: 3.2.2
> Environment: Cloudera distribution of zookeeper (patched to never 
> cache DNS entries)
> Debian lenny
>Reporter: Alexandre Hardy
>Priority: Critical
> Fix For: 3.3.3, 3.4.0
>
> Attachments: zklogs-20101102144159SAST.tar.gz
>
>
> We had three nodes running zookeeper:
>   * 192.168.130.10
>   * 192.168.130.11
>   * 192.168.130.14
> 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
> (automated startup). The new node had not participated in any zookeeper 
> quorum previously. The node 192.148.130.11 was permanently removed from 
> service and could not contribute to the quorum any further (powered off).
> DNS entries were updated for the new node to allow all the zookeeper servers 
> to find the new node.
> The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
> it had not seen the latest zxid.
> This particular problem has not been verified with later versions of 
> zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Amit Jaiswal (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928642#action_12928642
 ] 

Amit Jaiswal commented on ZOOKEEPER-918:


Thats a good suggestion. But I don't have access to create a new wiki page. 
Also, just saw some couple of new wiki pages devoted to bookkeeper performance, 
bookie recovery.

Please let me know how to publish this in wiki format. I am attaching the 
original doc format file too in case someone wants to take relevant section and 
publish in different wikis.

> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Priority: Trivial
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Amit Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Jaiswal updated ZOOKEEPER-918:
---

Attachment: BookKeeperInternals.doc

Attaching the original document file.

> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Priority: Trivial
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928697#action_12928697
 ] 

Patrick Hunt commented on ZOOKEEPER-918:


There are really two options for docs (today):

1) put it into svn as a forrest doc. typically this is for documentation that's 
version specific - needs to be versioned along with the code
2) put it into wiki, usually this is non-version specific detail.

putting into svn requires a patch for each change, which adds to the overhead. 
another way to go is to start on the wiki, once the doc is fairly stable move 
it to svn.



> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Priority: Trivial
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-918:
---

Priority: Minor  (was: Trivial)

> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Assignee: Amit Jaiswal
>Priority: Minor
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-05 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-918:
---

Assignee: Amit Jaiswal

> Review of BookKeeper Documentation (Sequence flow and failure scenarios)
> 
>
> Key: ZOOKEEPER-918
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
> Project: Zookeeper
>  Issue Type: Task
>  Components: documentation
>Reporter: Amit Jaiswal
>Assignee: Amit Jaiswal
>Priority: Trivial
> Fix For: 3.3.3, 3.4.0
>
> Attachments: BookKeeperInternals.doc, BookKeeperInternals.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have prepared a document describing some of the internals of bookkeeper in 
> terms of:
> 1. Sequence of operations
> 2. Files layout
> 3. Failure scenarios
> The document is prepared by mostly by reading the code. Can somebody who 
> understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-05 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-909:


Status: Open  (was: Patch Available)

once a couple of small changes are made to this patch, we should be good to go.

> Extract NIO specific code from ClientCnxn
> -
>
> Key: ZOOKEEPER-909
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
> Project: Zookeeper
>  Issue Type: Sub-task
>  Components: java client
>Reporter: Thomas Koch
>Assignee: Thomas Koch
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
> ZOOKEEPER-909.patch
>
>
> This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
> everything Netty related. This means this patch only extract all NIO specific 
> code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
> I've redone this patch from current trunk step by step now and couldn't find 
> any logical error. I've already done a couple of successful test runs and 
> will continue to do so this night.
> It would be nice, if we could apply this patch as soon as possible to trunk. 
> This allows us to continue to work on the netty integration without blocking 
> the ClientCnxn class. Adding Netty after this patch should be only a matter 
> of adding the ClientCnxnSocketNetty class with the appropriate test cases.
> You could help me by reviewing the patch and by running it on whatever test 
> server you have available. Please send me any complete failure log you should 
> encounter to thomas at koch point ro. Thx!
> Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.