date:20101107

[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-07 Thread Flavio Junqueira (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929345#action_12929345
]

Flavio Junqueira commented on ZOOKEEPER-914:

Hi Vishal, I also appreciate your contributions and your comments. I also
understand your frustration when you find issues with the code, but think that
it is possibly equally frustrating for the developer who thought that at least
basic issues were covered, so please try to think that we don't introduce bugs
on purpose (at least I don't) and our review process is not perfect.

Regarding clover reports, we have agreed already that code coverage is not
bulletproof, and in fact there has been several other metrics proposed in the
scientific literature, but it does indicate that some call path including a
give piece of code was exercised. It certainly doesn't measure more complex
cases, like race conditions, crashes and so on. In fact, if you have a better
way of measuring test coverage, I'd happy to hear about it.

I'm not sure if you agree, but it seems to me that we should close this jira
because the technical discussion here seems to be similar to the one of
ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what
will happen to this jira:

# My point about SO_TIMEOUT comes from here:
http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29
# I obviously prefer to go with real fixes instead of hacking, but we need to
have release 3.3.2 out, and it sounded like introducing a configurable timeout
would fix your problem until the next release;
# About testing beyond the handshake, I'm not sure what you're proposing. If
the blocking calls are part of the handshake and this is what is failing for
you, then this is what we should target now, no?

QuorumCnxManager blocks forever

Key: ZOOKEEPER-914
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
Project: Zookeeper
Issue Type: Bug
Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
Fix For: 3.3.3, 3.4.0

This was a disaster. While testing our application we ran into a scenario
where a rebooted follower could not join the cluster. Further debugging
showed that the follower could not join because the QuorumCnxManager on the
leader was blocked for indefinite amount of time in receiveConnect()
Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable
[0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
- locked 0x7fa93315f988 (a java.lang.Object)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
I had pointed out this bug along with several other problems in
QuorumCnxManager earlier in
https://issues.apache.org/jira/browse/ZOOKEEPER-900 and
https://issues.apache.org/jira/browse/ZOOKEEPER-822.
I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix
and a patch will be out soon.
The problem is that QuorumCnxManager is using SocketChannel in blocking mode.
It does a read() in receiveConnection() and a write() in initiateConnection().
Sorry, but this is really bad programming. Also, points out to lack of
failure tests for QuorumCnxManager.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-07 Thread Flavio Junqueira (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929354#action_12929354
]

Flavio Junqueira commented on ZOOKEEPER-917:

Hi Vishal, It is certainly understand not having dedicated development time
being an issue. I actually didn't know you're interested in the cluster
membership... I'm glad to hear, though.

On your questions:
# Suppose we have an ensemble comprising 3 servers: A, B, and C. Now suppose
that C is the leader, and both A and B follow C. If A disconnects from C for
whatever reason (e.g., network partition) and it tries to elect a leader, it
won't get any other process in the LOOKING state. It will actually receive a
notification from C saying that it is leading and one from B saying that it is
following C, both with an earlier leader election epoch. To avoid having A
locked out (not able to elect C as leader), we implemented this exception: a
process accepts going back to an earlier leader election only if it receives a
notification from the leader saying that it is leading and from a quorum saying
that it is following;
# I'm not sure if you referring to specific problem of this jira or if you are
asking about my hypothetical example. Assuming it is the former, the follower
(Follower:followLeader()) checks if the leader is proposing an earlier epoch,
and if not, it accepts the leader snapshot. Because the epoch is the same, all
followers will accept the leader snapshot follow it.

Leader election selected incorrect leader
-

Key: ZOOKEEPER-917
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
Project: Zookeeper
Issue Type: Bug
Components: leaderElection, server
Affects Versions: 3.2.2
Environment: Cloudera distribution of zookeeper (patched to never
cache DNS entries)
Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
Fix For: 3.3.3, 3.4.0

Attachments: zklogs-20101102144159SAST.tar.gz

We had three nodes running zookeeper:
* 192.168.130.10
* 192.168.130.11
* 192.168.130.14
192.168.130.11 failed, and was replaced by a new node 192.168.130.13
(automated startup). The new node had not participated in any zookeeper
quorum previously. The node 192.148.130.11 was permanently removed from
service and could not contribute to the quorum any further (powered off).
DNS entries were updated for the new node to allow all the zookeeper servers
to find the new node.
The new node 192.168.130.13 was selected as the LEADER, despite the fact that
it had not seen the latest zxid.
This particular problem has not been verified with later versions of
zookeeper, and no attempt has been made to reproduce this problem as yet.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929395#action_12929395
]

Thomas Koch commented on ZOOKEEPER-909:
---

I've added some javadoc and renamed socket to clientCnxnSocket everywhere. I'll
upload the patch when the tests completed.

Moving ZooKeeper.state to ClientCnxn.state came in handy at some point to avoid
unnecessary redirection and to clean the object dependency graph (
ZOOKEEPER-837 ). The state variable is only once accessed from ZooKeeper but
several times from ClientCnxn so it makes a lot of sense to move it where it is
needed.

SessionExpiredException can actually remain private, but EndOfStreamException
is used in ClientCnxnSocketNIO.

Extract NIO specific code from ClientCnxn
-

Key: ZOOKEEPER-909
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
Project: Zookeeper
Issue Type: Sub-task
Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
Fix For: 3.4.0

Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch,
ZOOKEEPER-909.patch

This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus
everything Netty related. This means this patch only extract all NIO specific
code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
I've redone this patch from current trunk step by step now and couldn't find
any logical error. I've already done a couple of successful test runs and
will continue to do so this night.
It would be nice, if we could apply this patch as soon as possible to trunk.
This allows us to continue to work on the netty integration without blocking
the ClientCnxn class. Adding Netty after this patch should be only a matter
of adding the ClientCnxnSocketNetty class with the appropriate test cases.
You could help me by reviewing the patch and by running it on whatever test
server you have available. Please send me any complete failure log you should
encounter to thomas at koch point ro. Thx!
Update: Until now, I've collected 8 successful builds in a row!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-909:
--

Attachment: ZOOKEEPER-909.patch

I couldn't wait for the tests to finish now, but it compiles and I did only 
renames, comments, visibility change and removal of dead code since the last 
patch.

 Extract NIO specific code from ClientCnxn
 -

 Key: ZOOKEEPER-909
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
 Project: Zookeeper
  Issue Type: Sub-task
  Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
 ZOOKEEPER-909.patch, ZOOKEEPER-909.patch


 This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
 everything Netty related. This means this patch only extract all NIO specific 
 code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
 I've redone this patch from current trunk step by step now and couldn't find 
 any logical error. I've already done a couple of successful test runs and 
 will continue to do so this night.
 It would be nice, if we could apply this patch as soon as possible to trunk. 
 This allows us to continue to work on the netty integration without blocking 
 the ClientCnxn class. Adding Netty after this patch should be only a matter 
 of adding the ClientCnxnSocketNetty class with the appropriate test cases.
 You could help me by reviewing the patch and by running it on whatever test 
 server you have available. Please send me any complete failure log you should 
 encounter to thomas at koch point ro. Thx!
 Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-909:
--

Status: Patch Available  (was: Open)

 Extract NIO specific code from ClientCnxn
 -

 Key: ZOOKEEPER-909
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
 Project: Zookeeper
  Issue Type: Sub-task
  Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
 ZOOKEEPER-909.patch, ZOOKEEPER-909.patch


 This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
 everything Netty related. This means this patch only extract all NIO specific 
 code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
 I've redone this patch from current trunk step by step now and couldn't find 
 any logical error. I've already done a couple of successful test runs and 
 will continue to do so this night.
 It would be nice, if we could apply this patch as soon as possible to trunk. 
 This allows us to continue to work on the netty integration without blocking 
 the ClientCnxn class. Adding Netty after this patch should be only a matter 
 of adding the ClientCnxnSocketNetty class with the appropriate test cases.
 You could help me by reviewing the patch and by running it on whatever test 
 server you have available. Please send me any complete failure log you should 
 encounter to thomas at koch point ro. Thx!
 Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (ZOOKEEPER-920) L7 (application layer) ping support

2010-11-07 Thread Chang Song (JIRA)

L7 (application layer) ping support
---

 Key: ZOOKEEPER-920
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-920
 Project: Zookeeper
  Issue Type: New Feature
  Components: c client
Affects Versions: 3.3.1
Reporter: Chang Song
Priority: Minor


Zookeeper is used in applications where fault tolerance is important. Its 
client i/o thread send/recv heartbeats to/fro Zookeeper ensemble to stay 
connected. However healthy heartbeat does not always means that the application 
that uses Zookeeper client is in good health, it only means that ZK client 
thread is in good health.

This I needed something that can tagged onto Zookeeper ping that represents L7 
(application) health as well.
I have modified C client source to support this in minimal way. I am new to 
Zookeeper, so please code review this code.  I am actually using this code in 
our in-house solution.

https://github.com/tru64ufs/zookeeper/commit/2196d6d5114a2fd2c0a3bc9a55f4494d47d2aece

Thank you very much.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

[jira] Commented: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

[jira] Created: (ZOOKEEPER-920) L7 (application layer) ping support

6 matches

Site Navigation

Mail list logo

Footer information