[jira] Commented: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-03 Thread Alex Baranau (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905821#action_12905821
 ] 

Alex Baranau commented on ZOOKEEPER-860:


Not sure that I follow why this issue was assigned to me. Is there anything I 
can do about it? I think I cannot commit the patch and hence resolve the 
issue...

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-03 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905836#action_12905836
 ] 

Flavio Junqueira commented on ZOOKEEPER-822:


{quote}
1. Blocking connects and accepts:
You are right, when the node is down TCP timeouts rule.

a) The first problem is in manager.toSend(). This invokes connectOne(), which 
does a blocking connect. While testing, I changed the code so that connectOne() 
starts a new thread called AsyncConnct(). AsyncConnect.run() does a 
socketChannel.connect(). After starting AsyncConnect, connectOne starts a 
timer. connectOne continues with normal operations if the connection is 
established before the timer expires, otherwise, when the timer expires it 
interrupts AsyncConnect() thread and returns. In this way, I can have an upper 
bound on the amount of time we need to wait for connect to succeed. Of course, 
this was a quick fix for my testing. Ideally, we should use Selector to do 
non-blocking connects/accepts. I am planning to do that later once we at least 
have a quick fix for the problem and consensus from others for the real fix 
(this problem is big blocker for us). Note that it is OK to do blocking IO in 
SenderWorker and RecvWorker threads since they block IO to the respective pe!
 er.
{quote}

As I commented before, it might be ok to make it asynchronous, especially if we 
have a way of checking that there is an attempt to establish a connection in 
progress. 
I'm also still intrigued about why this is a problem for you. I haven't seen 
any of this being a problem before, which of course doesn't mean we shouldn't 
fix it. It would be nice to understand what's special about your setup or if 
others have seen similar problems and I missed the reports.

{quote}
b) The blocking IO problem is not just restricted to connectOne(), but also in 
receiveConnection(). The Listener thread calls receiveConnection() for each 
incoming connection request. receiveConnection does blocking IO to get peer's 
info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that 
had sent the connection request. All of this is happening from the Listener. In 
short, if a peer fails after initiating a connection, the Listener thread won't 
be able to accept connections from other peers, because it would be stuck in 
read() or connetOne(). Also the code has an inherent cycle. 
initiateConnection() and receiveConnection() will have to be very carefully 
synchronized otherwise, we could run into deadlocks. This code is going to be 
difficult to maintain/modify.
{quote}

If I remember correctly, we currently synchronize connectOne and make all 
connection establishments through connectOne so that we make sure that we do 
one at a time. My understanding is that this should reduce the number of rounds 
of attempts to establish connections, perhaps at the cost of a longer delay in 
some runs. 

{quote}
2. Buggy senderWorkerMap handling:
The code that manages senderWorkerMap is very buggy. It is causing multiple 
election rounds. While debugging I found that sometimes after FLE a node will 
have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads 
for each peer.
{quote}

I don't think that having multiple rounds is bad; in fact, I think it is 
unavoidable using reasonable timeout values. The second part, however, sounds 
like a problem we should fix. 

{quote}
a) The receiveConnection() method calls the finish() method, which removes an 
entry from the map. Additionally, the thread itself calls finish() which could 
remove the newly added entry from the map. In short, receiveConnection is 
causing the exact condition that you mentioned above.
{quote}

I thought that we were increasing the intervals between notifications, and if 
so I believe the case you mention above should not happen more than a few 
times. Now, to fix it, it sounds like we need to check that the finish call is 
removing the correct object in sendWorkerMap. That is, obj.finish() should 
remove obj and do nothing if the SendWorker object in sendWorkerMap is a 
different one. What do you think?

{quote}
b) Apart from the bug in finish(), receiveConnection is making an entry in 
senderWorkerMap at the wrong place. Here's the buggy code:
SendWorker vsw = senderWorkerMap.get(sid);
senderWorkerMap.put(sid, sw);
if(vsw != null)
vsw.finish();
It makes an entry for the new thread and then calls finish, which causes the 
new thread to be removed from the Map. The old thread will also get terminated 
since finish() will interrupt the thread.
{quote}

See my comment above. Perhaps I should wait to see your proposed modifications, 
but I wonder if works to check that we are removing the correct SendWorker 
object.

{quote}
3. Race condition in receiveConnection and initiateConnection:

In theory, two peers can keep disconnecting each other's connection.

Example:
T0: Peer 0 

Build failed in Hudson: ZooKeeper-trunk #923

2010-09-03 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/923/

--
[...truncated 162872 lines...]
[junit] 2010-09-03 10:51:25,922 [myid:] - INFO  
[Thread-285:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-03 10:51:25,923 [myid:] - INFO  
[Thread-285:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:33405 (no session established for client)
[junit] 2010-09-03 10:51:25,923 [myid:] - INFO  [main:quorumb...@195] - 
127.0.0.1:11236 is accepting client connections
[junit] 2010-09-03 10:51:25,923 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11237
[junit] 2010-09-03 10:51:25,923 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:44608
[junit] 2010-09-03 10:51:25,924 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioserverc...@791] - Processing 
stat command from /127.0.0.1:44608
[junit] 2010-09-03 10:51:25,924 [myid:] - INFO  
[Thread-286:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-03 10:51:25,925 [myid:] - INFO  
[Thread-286:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:44608 (no session established for client)
[junit] 2010-09-03 10:51:25,925 [myid:] - INFO  [main:quorumb...@195] - 
127.0.0.1:11237 is accepting client connections
[junit] 2010-09-03 10:51:25,925 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11238
[junit] 2010-09-03 10:51:25,925 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:58661
[junit] 2010-09-03 10:51:25,926 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioserverc...@791] - Processing 
stat command from /127.0.0.1:58661
[junit] 2010-09-03 10:51:25,926 [myid:] - INFO  
[Thread-287:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-03 10:51:25,927 [myid:] - INFO  
[Thread-287:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:58661 (no session established for client)
[junit] 2010-09-03 10:51:25,927 [myid:] - INFO  [main:quorumb...@195] - 
127.0.0.1:11238 is accepting client connections
[junit] 2010-09-03 10:51:25,927 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-03 10:51:25,928 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:55577
[junit] 2010-09-03 10:51:25,928 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:55577
[junit] 2010-09-03 10:51:25,929 [myid:] - INFO  
[Thread-288:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:55577 (no session established for client)
[junit] 2010-09-03 10:51:26,179 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-03 10:51:26,179 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:55578
[junit] 2010-09-03 10:51:26,179 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:55578
[junit] 2010-09-03 10:51:26,180 [myid:] - INFO  
[Thread-289:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:55578 (no session established for client)
[junit] 2010-09-03 10:51:26,430 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-03 10:51:26,430 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:55579
[junit] 2010-09-03 10:51:26,431 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:55579
[junit] 2010-09-03 10:51:26,431 [myid:] - INFO  
[Thread-290:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:55579 (no session established for client)
[junit] 2010-09-03 10:51:26,681 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-03 10:51:26,682 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:55580
[junit] 2010-09-03 10:51:26,682 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:55580
[junit] 2010-09-03 10:51:26,682 [myid:] - INFO  
[Thread-291:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-03 10:51:26,683 [myid:] - INFO  
[Thread-291:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:55580 (no session established for client)
[junit] JMXEnv.dump() follows
[junit] 2010-09-03 10:51:26,683 [myid:] - INFO  

[jira] Created: (ZOOKEEPER-863) Runaway thread - Zookeeper inside Eclipse

2010-09-03 Thread Stephen McCants (JIRA)
Runaway thread - Zookeeper inside Eclipse
-

 Key: ZOOKEEPER-863
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-863
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
 Environment: Linux; x86
Reporter: Stephen McCants
Priority: Critical


I'm running Zookeeper inside an Eclipse application.  When I launch the 
application from inside Eclipse I use the following arguments:

-Dzoodiscovery.autoStart=true
-Dzoodiscovery.flavor=zoodiscovery.flavor.centralized=localhost

This causes the application to start its own ZooKeeper server inside the 
JVM/application.  It immediately goes into a runaway state.  The name of the 
runaway thread is NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181.  When I suspend 
this thread, the CPU usage returns to 0.  Here is a stack trace from that 
thread when it is suspended:

EPollArrayWrapper.epollWait(long, int, long, int) line: not available [native 
method]   
EPollArrayWrapper.poll(long) line: 215  
EPollSelectorImpl.doSelect(long) line: 77   
EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69  
EPollSelectorImpl(SelectorImpl).select(long) line: 80   
NIOServerCnxn$Factory.run() line: 232   

Any ideas what might be going wrong?

Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-864) Hedwig C++ client improvements

2010-09-03 Thread Ivan Kelly (JIRA)
Hedwig C++ client improvements
--

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0
 Attachments: ZOOKEEPER-864.diff

I changed the socket code to use boost asio. Now the client only creates one 
thread, and all operations are non-blocking. 

Tests are now automated, just run make check.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-864) Hedwig C++ client improvements

2010-09-03 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated ZOOKEEPER-864:
-

Attachment: ZOOKEEPER-864.diff

 Hedwig C++ client improvements
 --

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-864.diff


 I changed the socket code to use boost asio. Now the client only creates one 
 thread, and all operations are non-blocking. 
 Tests are now automated, just run make check.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-864) Hedwig C++ client improvements

2010-09-03 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated ZOOKEEPER-864:
-

Status: Patch Available  (was: Open)

 Hedwig C++ client improvements
 --

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-864.diff


 I changed the socket code to use boost asio. Now the client only creates one 
 thread, and all operations are non-blocking. 
 Tests are now automated, just run make check.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-865) Runaway thread

2010-09-03 Thread Stephen McCants (JIRA)
Runaway thread
--

 Key: ZOOKEEPER-865
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-865
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1, 3.3.0
 Environment: Linux; Java 1.6; x86;
Reporter: Stephen McCants
Priority: Critical


I'm starting a standalone Zookeeper server (v3.3.1).  That starts normally and 
does not have a runaway thread.

Next, I start an based Eclipse application that is using ZK 3.3.0 to register 
itself with the ZooKeeper server (3.3.1).  The Eclipse application using the 
following arguments to Eclipse:

-Dzoodiscovery.autoStart=true
-Dzoodiscovery.flavor=zoodiscovery.flavor.centralized=smccants.austin.ibm.com

When the Eclipse application starts, the ZK server prints out:

2010-09-03 09:59:46,006 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - 
Accepted socket connection from /9.53.189.11:42271
2010-09-03 09:59:46,039 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@776] - Client 
attempting to establish new session at /9.53.189.11:42271
2010-09-03 09:59:46,045 - INFO  [SyncThread:0:nioserverc...@1579] - Established 
session 0x12ad81b9002 with negotiated timeout 4000 for client 
/9.53.189.11:42271
2010-09-03 09:59:46,046 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - 
Accepted socket connection from /9.53.189.11:42272
2010-09-03 09:59:46,078 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@776] - Client 
attempting to establish new session at /9.53.189.11:42272
2010-09-03 09:59:46,080 - INFO  [SyncThread:0:nioserverc...@1579] - Established 
session 0x12ad81b9003 with negotiated timeout 4000 for client 
/9.53.189.11:42272

Then both the Eclipse application and the ZK server go into runaway states and 
consume 100% of the CPU.

Here is a view from top:

  PID USERPR  NI  VIRTRES  SHR S %CPU %MEMTIME+  COMMAND
4949 smccants  15   0  597m  78m 5964 S66.2  1.0  1:03.14 
autosubmitter
4876 smccants  17   0  554m  27m 6688 S30.9   0.3 0:34.74 java

PID 4949 (autosubmitter) is the Eclipse application and is using more than 
twice the CPU of PID 4876 (java) which is the ZK server.  They will continue in 
this state indefinitely.

I can attach a debugger to the Eclipse application and if I stop the thread 
named pool-1-thread-2-SendThread(smccants.austin.ibm.com:2181) and the 
runaway condition stops on both the application and ZK server.  However the ZK 
server reports:

2010-09-03 10:03:38,001 - INFO  [SessionTracker:zookeeperser...@315] - Expiring 
session 0x12ad81b9003, timeout of 4000ms exceeded
2010-09-03 10:03:38,002 - INFO  [ProcessThread:-1:preprequestproces...@208] - 
Processed session termination for sessionid: 0x12ad81b9003
2010-09-03 10:03:38,005 - INFO  [SyncThread:0:nioserverc...@1434] - Closed 
socket connection for client /9.53.189.11:42272 which had sessionid 
0x12ad81b9003

Here is the stack trace from the suspended thread:

EPollArrayWrapper.epollWait(long, int, long, int) line: not available [native 
method]   
EPollArrayWrapper.poll(long) line: 215  
EPollSelectorImpl.doSelect(long) line: 77   
EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69  
EPollSelectorImpl(SelectorImpl).select(long) line: 80   
ClientCnxn$SendThread.run() line: 1066  

Any ideas what might be going wrong?

Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: (ZOOKEEPER-844) handle auth failure in java client

2010-09-03 Thread Patrick Hunt
I don't see why we couldn't include it. Thanks!

Patrick

On Thu, Sep 2, 2010 at 12:41 PM, Fournier, Camille F. [Tech] 
camille.fourn...@gs.com wrote:

 Hi all,

 I would like to submit this patch into the 3.3 branch as well, since we are
 probably going to go into production with 3.3 and I'd rather not do a
 production release with a patched version of ZK if possible. I added a patch
 for this fix against the 3.3 branch to this ticket. Any idea of the odds of
 getting this in to the 3.3.2 release?

 Thanks,
 Camille

 -Original Message-
 From: Giridharan Kesavan (JIRA) [mailto:j...@apache.org]
 Sent: Tuesday, August 31, 2010 7:25 PM
 To: Fournier, Camille F. [Tech]
 Subject: [jira] Updated: (ZOOKEEPER-844) handle auth failure in java client


 [
 https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Giridharan Kesavan updated ZOOKEEPER-844:
 -

Status: Patch Available  (was: Open)

  handle auth failure in java client
  --
 
  Key: ZOOKEEPER-844
  URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844
  Project: Zookeeper
   Issue Type: Improvement
   Components: java client
 Affects Versions: 3.3.1
 Reporter: Camille Fournier
 Assignee: Camille Fournier
  Fix For: 3.4.0
 
  Attachments: ZOOKEEPER-844.patch
 
 
  ClientCnxn.java currently has the following code:
if (replyHdr.getXid() == -4) {
  // -2 is the xid for AuthPacket
  // TODO: process AuthPacket here
  if (LOG.isDebugEnabled()) {
  LOG.debug(Got auth sessionid:0x
  + Long.toHexString(sessionId));
  }
  return;
  }
  Auth failures appear to cause the server to disconnect but the client
 never gets a proper state change or notification that auth has failed, which
 makes handling this scenario very difficult as it causes the client to go
 into a loop of sending bad auth, getting disconnected, trying to reconnect,
 sending bad auth again, over and over.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client

2010-09-03 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-844:
---

Issue Type: Bug  (was: Improvement)

This is really a bug, not an improvement.

 handle auth failure in java client
 --

 Key: ZOOKEEPER-844
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844


 ClientCnxn.java currently has the following code:
   if (replyHdr.getXid() == -4) {
 // -2 is the xid for AuthPacket
 // TODO: process AuthPacket here
 if (LOG.isDebugEnabled()) {
 LOG.debug(Got auth sessionid:0x
 + Long.toHexString(sessionId));
 }
 return;
 }
 Auth failures appear to cause the server to disconnect but the client never 
 gets a proper state change or notification that auth has failed, which makes 
 handling this scenario very difficult as it causes the client to go into a 
 loop of sending bad auth, getting disconnected, trying to reconnect, sending 
 bad auth again, over and over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: About symbol table of Zookeeper c client

2010-09-03 Thread Patrick Hunt
This is a long standing issue slated for 4.0
https://issues.apache.org/jira/browse/ZOOKEEPER-295

Mahadev had done some work to reduce the exported symbols as part of 3.3,
perhaps this slipped through the net?

Mahadev - can we address this using the current mechanism?

https://issues.apache.org/jira/browse/ZOOKEEPER-295Patrick

On Thu, Sep 2, 2010 at 7:37 AM, Qian Ye yeqian@gmail.com wrote:

 Hi all:

 I'm writing a application in C which need to link both memcached's lib and
 zookeeper's c client lib. I found a symbol table conflict, because both
 libs
 provide implmentation(recordio.h/c) of function htonll. It seems that some
 functions of zookeeper c client, which can be accessed externally but uesd
 internally, have simple names. I think it will bring much symbol table
 confilct from time to time, and we should do something about it, e.g. add a
 specific prefix to these funcitons.

 thx

 --
 With Regards!

 Ye, Qian



Re: Problems in FLE implementation

2010-09-03 Thread Mahadev Konar
Hi Vishal,
  Thanks for picking this up. My comments are inline:


On 9/2/10 3:31 PM, Vishal K vishalm...@gmail.com wrote:

 Hi All,
 
 I had posted this message as a comment for ZOOKEEPER-822. I thought it might
 be a good idea to give a wider attention so that it will be easier to
 collect feedback.
 
 I found few problems in the FLE implementation while debugging for:
 https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email
 below might require some background. If necessary, please browse the JIRA. I
 have a patch for 1. a) and 2). I will send them out soon.
 
 1. Blocking connects and accepts:
 
 a) The first problem is in manager.toSend(). This invokes connectOne(),
 which does a blocking connect. While testing, I changed the code so that
 connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
 does a socketChannel.connect(). After starting AsyncConnect, connectOne
 starts a timer. connectOne continues with normal operations if the
 connection is established before the timer expires, otherwise, when the
 timer expires it interrupts AsyncConnect() thread and returns. In this way,
 I can have an upper bound on the amount of time we need to wait for connect
 to succeed. Of course, this was a quick fix for my testing. Ideally, we
 should use Selector to do non-blocking connects/accepts. I am planning to do
 that later once we at least have a quick fix for the problem and consensus
 from others for the real fix (this problem is big blocker for us). Note that
 it is OK to do blocking IO in SenderWorker and RecvWorker threads since they
 block IO to the respective peer.
Vishal, I am really concerned about starting up new threads in the server.
We really need a total revamp of this code (using NIO and selector). Is the
quick fix really required. Zookeeper servers have been running in production
for a while, and this problem hasn't been noticed by anyone. Shouldn't we
fix it with NIO then?


 
 b) The blocking IO problem is not just restricted to connectOne(), but also
 in receiveConnection(). The Listener thread calls receiveConnection() for
 each incoming connection request. receiveConnection does blocking IO to get
 peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the
 peer that had sent the connection request. All of this is happening from the
 Listener. In short, if a peer fails after initiating a connection, the
 Listener thread won't be able to accept connections from other peers,
 because it would be stuck in read() or connetOne(). Also the code has an
 inherent cycle. initiateConnection() and receiveConnection() will have to be
 very carefully synchronized otherwise, we could run into deadlocks. This
 code is going to be difficult to maintain/modify.
 

 2. Buggy senderWorkerMap handling:
 The code that manages senderWorkerMap is very buggy. It is causing multiple
 election rounds. While debugging I found that sometimes after FLE a node
 will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker
 threads for each peer.
IT would be great to clean it up!! I'd be happy to see this class be cleaned
up! :) 

 
 a) The receiveConnection() method calls the finish() method, which removes
 an entry from the map. Additionally, the thread itself calls finish() which
 could remove the newly added entry from the map. In short, receiveConnection
 is causing the exact condition that you mentioned above.
 
 b) Apart from the bug in finish(), receiveConnection is making an entry in
 senderWorkerMap at the wrong place. Here's the buggy code:
 SendWorker vsw = senderWorkerMap.get(sid);
 senderWorkerMap.put(sid, sw);
 if(vsw != null)
 vsw.finish();
 It makes an entry for the new thread and then calls finish, which causes the
 new thread to be removed from the Map. The old thread will also get
 terminated since finish() will interrupt the thread.
 
 3. Race condition in receiveConnection and initiateConnection:
 
 *In theory*, two peers can keep disconnecting each other's connection.
 
 Example:
 T0: Peer 0 initiates a connection (request 1)
  T1: Peer 1 receives connection from
 peer 0
  T2: Peer 1 calls receiveConnection()
 T2: Peer 0 closes connection to Peer 1 because its ID is lower.
 T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request
 2)
  T3: Peer 1 terminates older connection
 to peer 0
  T4: Peer 1 calls connectOne() which
 starts new sendWorker threads for peer 0
  T5: Peer 1 kills connection created in
 T3 because it receives another (request 2) connect request from 0
 
 The problem here is that while Peer 0 is accepting a connection from Peer 1
 it can also be initiating a connection to Peer 1. So if they hit the right
 frequencies they could sit in a connect/disconnect loop and cause multiple
 rounds of leader election.
 
 I think 

[jira] Commented: (ZOOKEEPER-863) Runaway thread - Zookeeper inside Eclipse

2010-09-03 Thread Stephen McCants (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906014#action_12906014
 ] 

Stephen McCants commented on ZOOKEEPER-863:
---

Removing the registered service after ZK had stopped running away, causes ZK to 
return to using 100% of the CPU.

 Runaway thread - Zookeeper inside Eclipse
 -

 Key: ZOOKEEPER-863
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-863
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
 Environment: Linux; x86
Reporter: Stephen McCants
Priority: Critical

 I'm running Zookeeper inside an Eclipse application.  When I launch the 
 application from inside Eclipse I use the following arguments:
 -Dzoodiscovery.autoStart=true
 -Dzoodiscovery.flavor=zoodiscovery.flavor.centralized=localhost
 This causes the application to start its own ZooKeeper server inside the 
 JVM/application.  It immediately goes into a runaway state.  The name of the 
 runaway thread is NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181.  When I 
 suspend this thread, the CPU usage returns to 0.  Here is a stack trace from 
 that thread when it is suspended:
 EPollArrayWrapper.epollWait(long, int, long, int) line: not available [native 
 method] 
 EPollArrayWrapper.poll(long) line: 215
 EPollSelectorImpl.doSelect(long) line: 77 
 EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69
 EPollSelectorImpl(SelectorImpl).select(long) line: 80 
 NIOServerCnxn$Factory.run() line: 232 
 Any ideas what might be going wrong?
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Problems in FLE implementation

2010-09-03 Thread Vishal K
Hi Mahadev,

To be honest, yes, we need the quick fix. I am really surprised why anyone
else is not seeing this problem. There is nothing special with our setup. If
you look at the JIRA, I have posted logs from various setups (different OS,
using physical machines, using virtual machines, etc). Also, the bug is
evident from the code. Pretty much every developer in our team has hit this
bug.

Now, we have an application that is highly time-sensitive. Maybe most of the
applications that ZK is running on today can tolerate a 60-80 seconds of FLE
convergence. For us such a long delays (under normal conidtions) are not
acceptable.

It will be nice if people can provide some feedback on how time sensitive
their application is? Is 60-80 seconds delay in FLE acceptable?
What has been your experience with running ZK in production? How often do
you have leader reboots?

Feedback will be greatly apprecaited.

Thanks.
-Vishal



On Fri, Sep 3, 2010 at 1:44 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

 Hi Vishal,
  Thanks for picking this up. My comments are inline:


 On 9/2/10 3:31 PM, Vishal K vishalm...@gmail.com wrote:

  Hi All,
 
  I had posted this message as a comment for ZOOKEEPER-822. I thought it
 might
  be a good idea to give a wider attention so that it will be easier to
  collect feedback.
 
  I found few problems in the FLE implementation while debugging for:
  https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email
  below might require some background. If necessary, please browse the
 JIRA. I
  have a patch for 1. a) and 2). I will send them out soon.
 
  1. Blocking connects and accepts:
 
  a) The first problem is in manager.toSend(). This invokes connectOne(),
  which does a blocking connect. While testing, I changed the code so that
  connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
  does a socketChannel.connect(). After starting AsyncConnect, connectOne
  starts a timer. connectOne continues with normal operations if the
  connection is established before the timer expires, otherwise, when the
  timer expires it interrupts AsyncConnect() thread and returns. In this
 way,
  I can have an upper bound on the amount of time we need to wait for
 connect
  to succeed. Of course, this was a quick fix for my testing. Ideally, we
  should use Selector to do non-blocking connects/accepts. I am planning to
 do
  that later once we at least have a quick fix for the problem and
 consensus
  from others for the real fix (this problem is big blocker for us). Note
 that
  it is OK to do blocking IO in SenderWorker and RecvWorker threads since
 they
  block IO to the respective peer.
 Vishal, I am really concerned about starting up new threads in the server.
 We really need a total revamp of this code (using NIO and selector). Is the
 quick fix really required. Zookeeper servers have been running in
 production
 for a while, and this problem hasn't been noticed by anyone. Shouldn't we
 fix it with NIO then?


 
  b) The blocking IO problem is not just restricted to connectOne(), but
 also
  in receiveConnection(). The Listener thread calls receiveConnection() for
  each incoming connection request. receiveConnection does blocking IO to
 get
  peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to
 the
  peer that had sent the connection request. All of this is happening from
 the
  Listener. In short, if a peer fails after initiating a connection, the
  Listener thread won't be able to accept connections from other peers,
  because it would be stuck in read() or connetOne(). Also the code has an
  inherent cycle. initiateConnection() and receiveConnection() will have to
 be
  very carefully synchronized otherwise, we could run into deadlocks. This
  code is going to be difficult to maintain/modify.
 

  2. Buggy senderWorkerMap handling:
  The code that manages senderWorkerMap is very buggy. It is causing
 multiple
  election rounds. While debugging I found that sometimes after FLE a node
  will have its sendWorkerMap empty even if it has SenderWorker and
 RecvWorker
  threads for each peer.
 IT would be great to clean it up!! I'd be happy to see this class be
 cleaned
 up! :)

 
  a) The receiveConnection() method calls the finish() method, which
 removes
  an entry from the map. Additionally, the thread itself calls finish()
 which
  could remove the newly added entry from the map. In short,
 receiveConnection
  is causing the exact condition that you mentioned above.
 
  b) Apart from the bug in finish(), receiveConnection is making an entry
 in
  senderWorkerMap at the wrong place. Here's the buggy code:
  SendWorker vsw = senderWorkerMap.get(sid);
  senderWorkerMap.put(sid, sw);
  if(vsw != null)
  vsw.finish();
  It makes an entry for the new thread and then calls finish, which causes
 the
  new thread to be removed from the Map. The old thread will also get
  terminated since finish() will interrupt the thread.
 
  3. Race condition in 

[jira] Updated: (ZOOKEEPER-863) Runaway thread - Zookeeper inside Eclipse

2010-09-03 Thread Stephen McCants (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen McCants updated ZOOKEEPER-863:
--

Attachment: zookeeper.log

 Runaway thread - Zookeeper inside Eclipse
 -

 Key: ZOOKEEPER-863
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-863
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
 Environment: Linux; x86
Reporter: Stephen McCants
Priority: Critical
 Attachments: zookeeper.log


 I'm running Zookeeper inside an Eclipse application.  When I launch the 
 application from inside Eclipse I use the following arguments:
 -Dzoodiscovery.autoStart=true
 -Dzoodiscovery.flavor=zoodiscovery.flavor.centralized=localhost
 This causes the application to start its own ZooKeeper server inside the 
 JVM/application.  It immediately goes into a runaway state.  The name of the 
 runaway thread is NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181.  When I 
 suspend this thread, the CPU usage returns to 0.  Here is a stack trace from 
 that thread when it is suspended:
 EPollArrayWrapper.epollWait(long, int, long, int) line: not available [native 
 method] 
 EPollArrayWrapper.poll(long) line: 215
 EPollSelectorImpl.doSelect(long) line: 77 
 EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69
 EPollSelectorImpl(SelectorImpl).select(long) line: 80 
 NIOServerCnxn$Factory.run() line: 232 
 Any ideas what might be going wrong?
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Problems in FLE implementation

2010-09-03 Thread Vishal K
Hi Flavio,

On Fri, Sep 3, 2010 at 3:02 PM, Flavio Junqueira f...@yahoo-inc.com wrote:

 Vishal, 60-80 seconds is definitely high, and I would expect people to
 complain if they were observing such an amount of time to recover. I
 personally haven't seen any such a case.


Can you describe how you were trying to reproduce the bug? On physical
machines, it took me 15 retries (reboot -n leader) to reproduce the problem.
On VMs it is lot more frequent.



 On my end, you have good points, but I'm not entirely convinced that we
 need changes as you're proposing them. Seeing a patch would definitely help
 to determine. If you can't provide a patch due to legal issue, we should
 work on one or more to fix at least some of the issues you observed.


You are right, my fixes may not be the best approach. My intention was to
have a quick fix for our internal use and then start-off a discussion for
real fix. I will send out the diff soon.


 I also agree that it would be nice to have the numbers you are requesting.
 I would love to see


 Thanks,
 -Flavio


Thanks.
-Vishal


 On Sep 3, 2010, at 8:51 PM, Vishal K wrote:

 Hi Mahadev,

 To be honest, yes, we need the quick fix. I am really surprised why anyone
 else is not seeing this problem. There is nothing special with our setup.
 If
 you look at the JIRA, I have posted logs from various setups (different OS,
 using physical machines, using virtual machines, etc). Also, the bug is
 evident from the code. Pretty much every developer in our team has hit this
 bug.

 Now, we have an application that is highly time-sensitive. Maybe most of
 the
 applications that ZK is running on today can tolerate a 60-80 seconds of
 FLE
 convergence. For us such a long delays (under normal conidtions) are not
 acceptable.

 It will be nice if people can provide some feedback on how time sensitive
 their application is? Is 60-80 seconds delay in FLE acceptable?
 What has been your experience with running ZK in production? How often do
 you have leader reboots?

 Feedback will be greatly apprecaited.

 Thanks.
 -Vishal



 On Fri, Sep 3, 2010 at 1:44 PM, Mahadev Konar maha...@yahoo-inc.com
 wrote:

 Hi Vishal,

 Thanks for picking this up. My comments are inline:



 On 9/2/10 3:31 PM, Vishal K vishalm...@gmail.com wrote:


 Hi All,


 I had posted this message as a comment for ZOOKEEPER-822. I thought it

 might

 be a good idea to give a wider attention so that it will be easier to

 collect feedback.


 I found few problems in the FLE implementation while debugging for:

 https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email

 below might require some background. If necessary, please browse the

 JIRA. I

 have a patch for 1. a) and 2). I will send them out soon.


 1. Blocking connects and accepts:


 a) The first problem is in manager.toSend(). This invokes connectOne(),

 which does a blocking connect. While testing, I changed the code so that

 connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()

 does a socketChannel.connect(). After starting AsyncConnect, connectOne

 starts a timer. connectOne continues with normal operations if the

 connection is established before the timer expires, otherwise, when the

 timer expires it interrupts AsyncConnect() thread and returns. In this

 way,

 I can have an upper bound on the amount of time we need to wait for

 connect

 to succeed. Of course, this was a quick fix for my testing. Ideally, we

 should use Selector to do non-blocking connects/accepts. I am planning to

 do

 that later once we at least have a quick fix for the problem and

 consensus

 from others for the real fix (this problem is big blocker for us). Note

 that

 it is OK to do blocking IO in SenderWorker and RecvWorker threads since

 they

 block IO to the respective peer.

 Vishal, I am really concerned about starting up new threads in the server.

 We really need a total revamp of this code (using NIO and selector). Is the

 quick fix really required. Zookeeper servers have been running in

 production

 for a while, and this problem hasn't been noticed by anyone. Shouldn't we

 fix it with NIO then?




 b) The blocking IO problem is not just restricted to connectOne(), but

 also

 in receiveConnection(). The Listener thread calls receiveConnection() for

 each incoming connection request. receiveConnection does blocking IO to

 get

 peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to

 the

 peer that had sent the connection request. All of this is happening from

 the

 Listener. In short, if a peer fails after initiating a connection, the

 Listener thread won't be able to accept connections from other peers,

 because it would be stuck in read() or connetOne(). Also the code has an

 inherent cycle. initiateConnection() and receiveConnection() will have to

 be

 very carefully synchronized otherwise, we could run into deadlocks. This

 code is going to be difficult to maintain/modify.



 2. 

[jira] Commented: (ZOOKEEPER-864) Hedwig C++ client improvements

2010-09-03 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906077#action_12906077
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-864:
---

Thanks for the patch, Ivan!

What do we need to do before we can check in this patch?

--Michi

 Hedwig C++ client improvements
 --

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-864.diff


 I changed the socket code to use boost asio. Now the client only creates one 
 thread, and all operations are non-blocking. 
 Tests are now automated, just run make check.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-864) Hedwig C++ client improvements

2010-09-03 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906102#action_12906102
 ] 

Mahadev konar commented on ZOOKEEPER-864:
-

michi to answer your question,
  all we need is a careful review.



 Hedwig C++ client improvements
 --

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-864.diff


 I changed the socket code to use boost asio. Now the client only creates one 
 thread, and all operations are non-blocking. 
 Tests are now automated, just run make check.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-03 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-822:


 Assignee: Vishal K
Fix Version/s: 3.3.2
   3.4.0

Marking this for 3.3.2, to see if we want this included in 3.3.2.



 Leader election taking a long time  to complete
 ---

 Key: ZOOKEEPER-822
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.0
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
 test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz


 Created a 3 node cluster.
 1 Fail the ZK leader
 2. Let leader election finish. Restart the leader and let it join the 
 3. Repeat 
 After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
 Note- we didn't have any ZK clients and no new znodes were created.
 zoo.cfg is shown below:
 #Mon Jul 19 12:15:10 UTC 2010
 server.1=192.168.4.12\:2888\:3888
 server.0=192.168.4.11\:2888\:3888
 clientPort=2181
 dataDir=/var/zookeeper
 syncLimit=2
 server.2=192.168.4.13\:2888\:3888
 initLimit=5
 tickTime=2000
 I have attached logs from two nodes that took a long time to form the cluster 
 after failing the leader. The leader was down anyways so logs from that node 
 shouldn't matter.
 Look for START HERE. Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-03 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906223#action_12906223
 ] 

Mahadev konar commented on ZOOKEEPER-860:
-

alex, the assignment just means that you are working on the patch currently. A  
committer will review and provide you feedback or commit if deemed fit for the 
project. Hope that helps.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-03 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-860:


Fix Version/s: 3.4.0

marking it for 3.4 for keeping track.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: race condition in InvalidSnapShotTest on client close

2010-09-03 Thread Mahadev Konar
Hi Thomas,
  Sorry for my late response. Please open a jira regarding this. Is this fixed 
in your netty patc hfor the client?

Htanks
mahadev


On 9/1/10 9:09 AM, Thomas Koch tho...@koch.ro wrote:

Hi,

I believe, that I've found a race condition in
org.apache.zookeeper.server.InvalidSnapshotTest
In this test the server is closed before the client. The client, on close(),
submits as last package with type ZooDefs.OpCode.closeSession and waits for
this package to be finished.
However, nobody is there to awake the thread from packet.wait(). The
sendThread will on cleanup call packet.notifyAll() in finishpackage.
The race condition is: If an exception occurs in the sendThread, closing is
already true, so the sendThread breaks out of it's loop, calls cleanup and
finishes. If this happens, before the main thread calls packet.wait() then
there's nobody left to awake the main thread.

Regards,

Thomas Koch, http://www.koch.ro