[jira] [Created] (ZOOKEEPER-4246) Resource leaks in org.apache.zookeeper.server.persistence.SnapStream#getInputStream and #getOutputStream

2021-03-11 Thread Martin Kellogg (Jira)
Martin Kellogg created ZOOKEEPER-4246:
-

 Summary: Resource leaks in 
org.apache.zookeeper.server.persistence.SnapStream#getInputStream and 
#getOutputStream
 Key: ZOOKEEPER-4246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4246
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Martin Kellogg


 There are three (related) possible resource leaks in the `getInputStream` and 
`getOutputStream` methods in `SnapStream.java`. I noticed the first because of 
the use of the error-prone `GZIPOutputStream`, and the other two after looking 
at the surrounding code.

Here is the offending code (copied from 
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/persistence/SnapStream.java#L102]):
{noformat}
/**
 * Return the CheckedInputStream based on the extension of the fileName.
 *
 * @param file the file the InputStream read from
 * @return the specific InputStream
 * @throws IOException
 */
public static CheckedInputStream getInputStream(File file) throws 
IOException {
FileInputStream fis = new FileInputStream(file);
InputStream is;
switch (getStreamMode(file.getName())) {
case GZIP:
is = new GZIPInputStream(fis);
break;
case SNAPPY:
is = new SnappyInputStream(fis);
break;
case CHECKED:
default:
is = new BufferedInputStream(fis);
}
return new CheckedInputStream(is, new Adler32());
}

/**
 * Return the OutputStream based on predefined stream mode.
 *
 * @param file the file the OutputStream writes to
 * @param fsync sync the file immediately after write
 * @return the specific OutputStream
 * @throws IOException
 */
public static CheckedOutputStream getOutputStream(File file, boolean fsync) 
throws IOException {
OutputStream fos = fsync ? new AtomicFileOutputStream(file) : new 
FileOutputStream(file);
OutputStream os;
switch (streamMode) {
case GZIP:
os = new GZIPOutputStream(fos);
break;
case SNAPPY:
os = new SnappyOutputStream(fos);
break;
case CHECKED:
default:
os = new BufferedOutputStream(fos);
}
return new CheckedOutputStream(os, new Adler32());
}
{noformat}

All three possible resource leaks are caused by the constructors of the 
intermediate streams (i.e. `is` and `os`), some of which might throw 
`IOException`s:
 * in `getOutputStream`, the call to `new GZIPOutputStream` can throw an 
exception, because `GZIPOutputStream` writes out the header in the constructor. 
If it does throw, then `fos` is never closed. That it does so makes it hard to 
use correctly; someone raised this as an issue with the JDK folks 
[here|https://bugs.openjdk.java.net/browse/JDK-8180899], but they closed it as 
"won't fix" because the constructor is documented to throw (hence the need to 
catch the exception here).
 * in `getInputStream`, the call to `new GZIPInputStream` can throw an 
`IOException` for a similar reason, causing the file handle held by `fis` to 
leak.
 * similarly, the call to `new SnappyInputStream` can throw an `IOException`, 
because it tries to read the file header during construction, which also causes 
`fis` to leak. `SnappyOutputStream` cannot throw; I checked 
[here|https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyOutputStream.java].

I'll submit a PR with a (simple) fix shortly after this bug report goes up and 
gets assigned an issue number, and add a link to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-4245) Resource leaks in org.apache.zookeeper.server.persistence.SnapStream#getInputStream and #getOutputStream

2021-03-11 Thread Martin Kellogg (Jira)
Martin Kellogg created ZOOKEEPER-4245:
-

 Summary: Resource leaks in 
org.apache.zookeeper.server.persistence.SnapStream#getInputStream and 
#getOutputStream
 Key: ZOOKEEPER-4245
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4245
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Martin Kellogg


 There are three (related) possible resource leaks in the `getInputStream` and 
`getOutputStream` methods in `SnapStream.java`. I noticed the first because of 
the use of the error-prone `GZIPOutputStream`, and the other two after looking 
at the surrounding code.

Here is the offending code (copied from 
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/persistence/SnapStream.java#L102]):
{noformat}
/**
 * Return the CheckedInputStream based on the extension of the fileName.
 *
 * @param file the file the InputStream read from
 * @return the specific InputStream
 * @throws IOException
 */
public static CheckedInputStream getInputStream(File file) throws 
IOException {
FileInputStream fis = new FileInputStream(file);
InputStream is;
switch (getStreamMode(file.getName())) {
case GZIP:
is = new GZIPInputStream(fis);
break;
case SNAPPY:
is = new SnappyInputStream(fis);
break;
case CHECKED:
default:
is = new BufferedInputStream(fis);
}
return new CheckedInputStream(is, new Adler32());
}

/**
 * Return the OutputStream based on predefined stream mode.
 *
 * @param file the file the OutputStream writes to
 * @param fsync sync the file immediately after write
 * @return the specific OutputStream
 * @throws IOException
 */
public static CheckedOutputStream getOutputStream(File file, boolean fsync) 
throws IOException {
OutputStream fos = fsync ? new AtomicFileOutputStream(file) : new 
FileOutputStream(file);
OutputStream os;
switch (streamMode) {
case GZIP:
os = new GZIPOutputStream(fos);
break;
case SNAPPY:
os = new SnappyOutputStream(fos);
break;
case CHECKED:
default:
os = new BufferedOutputStream(fos);
}
return new CheckedOutputStream(os, new Adler32());
}
{noformat}

All three possible resource leaks are caused by the constructors of the 
intermediate streams (i.e. `is` and `os`), some of which might throw 
`IOException`s:
 * in `getOutputStream`, the call to `new GZIPOutputStream` can throw an 
exception, because `GZIPOutputStream` writes out the header in the constructor. 
If it does throw, then `fos` is never closed. That it does so makes it hard to 
use correctly; someone raised this as an issue with the JDK folks 
[here|https://bugs.openjdk.java.net/browse/JDK-8180899], but they closed it as 
"won't fix" because the constructor is documented to throw (hence the need to 
catch the exception here).
 * in `getInputStream`, the call to `new GZIPInputStream` can throw an 
`IOException` for a similar reason, causing the file handle held by `fis` to 
leak.
 * similarly, the call to `new SnappyInputStream` can throw an `IOException`, 
because it tries to read the file header during construction, which also causes 
`fis` to leak. `SnappyOutputStream` cannot throw; I checked 
[here|https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyOutputStream.java].

I'll submit a PR with a (simple) fix shortly after this bug report goes up and 
gets assigned an issue number, and add a link to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-4244) Resource leaks in org.apache.zookeeper.server.persistence.SnapStream#getInputStream and #getOutputStream

2021-03-11 Thread Martin Kellogg (Jira)
Martin Kellogg created ZOOKEEPER-4244:
-

 Summary: Resource leaks in 
org.apache.zookeeper.server.persistence.SnapStream#getInputStream and 
#getOutputStream
 Key: ZOOKEEPER-4244
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4244
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Martin Kellogg


 There are three (related) possible resource leaks in the `getInputStream` and 
`getOutputStream` methods in `SnapStream.java`. I noticed the first because of 
the use of the error-prone `GZIPOutputStream`, and the other two after looking 
at the surrounding code.

Here is the offending code (copied from 
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/persistence/SnapStream.java#L102]):
{noformat}
/**
 * Return the CheckedInputStream based on the extension of the fileName.
 *
 * @param file the file the InputStream read from
 * @return the specific InputStream
 * @throws IOException
 */
public static CheckedInputStream getInputStream(File file) throws 
IOException {
FileInputStream fis = new FileInputStream(file);
InputStream is;
switch (getStreamMode(file.getName())) {
case GZIP:
is = new GZIPInputStream(fis);
break;
case SNAPPY:
is = new SnappyInputStream(fis);
break;
case CHECKED:
default:
is = new BufferedInputStream(fis);
}
return new CheckedInputStream(is, new Adler32());
}

/**
 * Return the OutputStream based on predefined stream mode.
 *
 * @param file the file the OutputStream writes to
 * @param fsync sync the file immediately after write
 * @return the specific OutputStream
 * @throws IOException
 */
public static CheckedOutputStream getOutputStream(File file, boolean fsync) 
throws IOException {
OutputStream fos = fsync ? new AtomicFileOutputStream(file) : new 
FileOutputStream(file);
OutputStream os;
switch (streamMode) {
case GZIP:
os = new GZIPOutputStream(fos);
break;
case SNAPPY:
os = new SnappyOutputStream(fos);
break;
case CHECKED:
default:
os = new BufferedOutputStream(fos);
}
return new CheckedOutputStream(os, new Adler32());
}
{noformat}
All three possible resource leaks are caused by the constructors of the 
intermediate streams (i.e. `is` and `os`), some of which might throw 
`IOException`s:
 * in `getOutputStream`, the call to `new GZIPOutputStream` can throw an 
exception, because `GZIPOutputStream` writes out the header in the constructor. 
If it does throw, then `fos` is never closed. That it does so makes it hard to 
use correctly; someone raised this as an issue with the JDK folks 
[here|https://bugs.openjdk.java.net/browse/JDK-8180899], but they closed it as 
"won't fix" because the constructor is documented to throw (hence the need to 
catch the exception here).
 * in `getInputStream`, the call to `new GZIPInputStream` can throw an 
`IOException` for a similar reason, causing the file handle held by `fis` to 
leak.
 * similarly, the call to `new SnappyInputStream` can throw an `IOException`, 
because it tries to read the file header during construction, which also causes 
`fis` to leak. `SnappyOutputStream` cannot throw; I checked 
[here|https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyOutputStream.java].

I'll submit a PR with a (simple) fix shortly after this bug report goes up and 
gets assigned an issue number, and add a link to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-4243) Resource leaks in org.apache.zookeeper.server.persistence.SnapStream#getInputStream and #getOutputStream

2021-03-11 Thread Martin Kellogg (Jira)
Martin Kellogg created ZOOKEEPER-4243:
-

 Summary: Resource leaks in 
org.apache.zookeeper.server.persistence.SnapStream#getInputStream and 
#getOutputStream
 Key: ZOOKEEPER-4243
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4243
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Martin Kellogg


There are three (related) possible resource leaks in the getInputStream and 
getOutputStream methods in SnapStream.java. I noticed the first because of the 
use of the error-prone GZIPOutputStream, and the other two after looking at the 
surrounding code.

Here is the offending code (copied from 
[https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/persistence/SnapStream.java#L102):]

 
{code:java}
/** 
  * Return the CheckedInputStream based on the extension of the fileName.
  * 
  * @param file the file the InputStream read from
  * @return the specific InputStream 
  * @throws IOException
  */
public static CheckedInputStream getInputStream(File file) throws IOException { 
 
FileInputStream fis = new FileInputStream(file);
InputStream is; 
switch (getStreamMode(file.getName())) { 
case GZIP:
is = new GZIPInputStream(fis);
break;
case SNAPPY:
is = new SnappyInputStream(fis);  
break;
case CHECKED:   
default:
is = new BufferedInputStream(fis);
}
return new CheckedInputStream(is, new Adler32());
}

/** 
  * Return the OutputStream based on predefined stream mode. 
  * 
  * @param file the file the OutputStream writes to 
  * @param fsync sync the file immediately after write 
  * @return the specific OutputStream 
  * @throws IOException 
  */
public static CheckedOutputStream getOutputStream(File file, boolean fsync) 
throws IOException {
OutputStream fos = fsync ? new AtomicFileOutputStream(file) : new 
FileOutputStream(file);
OutputStream os;
switch (streamMode) {
case GZIP:
os = new GZIPOutputStream(fos);  
break;
case SNAPPY:
os = new SnappyOutputStream(fos);  
break;
case CHECKED:   
default: 
os = new BufferedOutputStream(fos); 
}   
return new CheckedOutputStream(os, new Adler32());  
}{code}
All three possible resource leaks are caused by the constructors of the 
intermediate streams (i.e. is and os), some of which might throw IOExceptions:
 * in getOutputStream, the call to "new GZIPOutputStream" can throw an 
exception, because GZIPOutputStream writes out the header in the constructor. 
If it does throw, then fos is never closed. That it does so makes it hard to 
use correctly; someone raised this as an issue with the JDK folks 
[here|[https://bugs.openjdk.java.net/browse/JDK-8180899]], but they closed it 
as "won't fix" because the constructor is documented to throw (hence why we 
need to catch the exception here).
 * in getInputStream, the call to "new GZIPInputStream" can throw an 
IOException for a similar reason, causing the file handle held by fis to leak.
 * similarly, the call to "new SnappyInputStream" can throw an IOException, 
because it tries to read the file header during construction, which also causes 
fis to leak. SnappyOutputStream cannot throw; I checked 
[here|[https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyOutputStream.java]].

I will submit a PR with a fix on Github shortly and update this description 
with a link.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[3.4.12] Missing OVERSEER doc. solr-user@lucene....

2019-05-21 Thread Will Martin
Cross-posting this for a sound reporter. He is a top technical resource on 
the list. Not given to hyperbole in bug reports.

Is there a acl'd JIRA for zookeeper?


to solr-user

We have a 6.6.2 cluster in prod that appears to have no overseer. In 
/overseer_elect on ZK, there is an election folder, but no leader document. An 
OVERSEERSTATUS request fails with a timeout.

I'm going to try ADDROLE, but I'd be delighted to hear any other ideas. We've 
diverted all the traffic to the backing cluster, so we can blow this one away 
and rebuild.

Looking at the Zookeeper logs, I see a few instances of network failures across 
all three nodes.


I *have the logs* from each of the Zookeepers.

We are running 3.4.12.




[jira] [Commented] (ZOOKEEPER-2125) SSL on Netty client-server communication

2018-10-04 Thread Martin M (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638309#comment-16638309
 ] 

Martin M commented on ZOOKEEPER-2125:
-

In this Jira item it is indicated that in order to configure SSl, one has to 
specified secureClientPort in zoo.cfg.

However, the documentation 
(https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html) says:
{noformat}
Starting with 3.5.0 the clientPort and clientPortAddress configuration 
parameters should no longer be used
{noformat}
I have tried setting the SSL port in the dynamic configuration, but it doesnt 
work. The secureClientPort must be specified in zoo.cfg.
When i specify both securePortClient in zoo.cfg and the dynamic configuration, 
ZK server doesnt start anymore.
How to fix this?
Thanks

> SSL on Netty client-server communication
> 
>
> Key: ZOOKEEPER-2125
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2125
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Hongchao Deng
>Assignee: Hongchao Deng
>Priority: Major
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-2125-build.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, testKeyStore.jks, 
> testTrustStore.jks
>
>
> Supporting SSL on Netty client-server communication. 
> 1. It supports keystore and trustore usage. 
> 2. It adds an additional ZK server port which supports SSL. This would be 
> useful for rolling upgrade.
> RB: https://reviews.apache.org/r/31277/
> The patch includes three files: 
> * testing purpose keystore and truststore under 
> "$(ZK_REPO_HOME)/src/java/test/data/ssl". Might need to create "ssl/".
> * latest ZOOKEEPER-2125.patch
> h2. How to use it
> You need to set some parameters on both ZK server and client.
> h3. Server
> You need to specify a listening SSL port in "zoo.cfg":
> {code}
> secureClientPort=2281
> {code}
> Just like what you did with "clientPort". And then set some jvm flags:
> {code}
> export 
> SERVER_JVMFLAGS="-Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
>  -Dzookeeper.ssl.keyStore.location=/root/zookeeper/ssl/testKeyStore.jks 
> -Dzookeeper.ssl.keyStore.password=testpass 
> -Dzookeeper.ssl.trustStore.location=/root/zookeeper/ssl/testTrustStore.jks 
> -Dzookeeper.ssl.trustStore.password=testpass"
> {code}
> Please change keystore and truststore parameters accordingly.
> h3. Client
> You need to set jvm flags:
> {code}
> export 
> CLIENT_JVMFLAGS="-Dzookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
>  -Dzookeeper.client.secure=true 
> -Dzookeeper.ssl.keyStore.location=/root/zookeeper/ssl/testKeyStore.jks 
> -Dzookeeper.ssl.keyStore.password=testpass 
> -Dzookeeper.ssl.trustStore.location=/root/zookeeper/ssl/testTrustStore.jks 
> -Dzookeeper.ssl.trustStore.password=testpass"
> {code}
> change keystore and truststore parameters accordingly.
> And then connect to the server's SSL port, in this case:
> {code}
> bin/zkCli.sh -server 127.0.0.1:2281
> {code}
> If you have any feedback, you are more than welcome to discuss it here!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2125) SSL on Netty client-server communication

2018-10-04 Thread Martin M (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638309#comment-16638309
 ] 

Martin M edited comment on ZOOKEEPER-2125 at 10/4/18 2:40 PM:
--

In this Jira item it is indicated that in order to configure SSL, one has to 
specify the secureClientPort in zoo.cfg.

However, the documentation 
(https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html) says:
{noformat}
Starting with 3.5.0 the clientPort and clientPortAddress configuration 
parameters should no longer be used
{noformat}
I have tried setting the SSL port in the dynamic configuration, but it doesnt 
work. The secureClientPort must be specified in zoo.cfg.
When i specify both securePortClient in zoo.cfg and the dynamic configuration, 
ZK server doesnt start anymore.
How to fix this?
Thanks


was (Author: mar.ian):
In this Jira item it is indicated that in order to configure SSl, one has to 
specified secureClientPort in zoo.cfg.

However, the documentation 
(https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html) says:
{noformat}
Starting with 3.5.0 the clientPort and clientPortAddress configuration 
parameters should no longer be used
{noformat}
I have tried setting the SSL port in the dynamic configuration, but it doesnt 
work. The secureClientPort must be specified in zoo.cfg.
When i specify both securePortClient in zoo.cfg and the dynamic configuration, 
ZK server doesnt start anymore.
How to fix this?
Thanks

> SSL on Netty client-server communication
> 
>
> Key: ZOOKEEPER-2125
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2125
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Hongchao Deng
>Assignee: Hongchao Deng
>Priority: Major
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-2125-build.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, 
> ZOOKEEPER-2125.patch, ZOOKEEPER-2125.patch, testKeyStore.jks, 
> testTrustStore.jks
>
>
> Supporting SSL on Netty client-server communication. 
> 1. It supports keystore and trustore usage. 
> 2. It adds an additional ZK server port which supports SSL. This would be 
> useful for rolling upgrade.
> RB: https://reviews.apache.org/r/31277/
> The patch includes three files: 
> * testing purpose keystore and truststore under 
> "$(ZK_REPO_HOME)/src/java/test/data/ssl". Might need to create "ssl/".
> * latest ZOOKEEPER-2125.patch
> h2. How to use it
> You need to set some parameters on both ZK server and client.
> h3. Server
> You need to specify a listening SSL port in "zoo.cfg":
> {code}
> secureClientPort=2281
> {code}
> Just like what you did with "clientPort". And then set some jvm flags:
> {code}
> export 
> SERVER_JVMFLAGS="-Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
>  -Dzookeeper.ssl.keyStore.location=/root/zookeeper/ssl/testKeyStore.jks 
> -Dzookeeper.ssl.keyStore.password=testpass 
> -Dzookeeper.ssl.trustStore.location=/root/zookeeper/ssl/testTrustStore.jks 
> -Dzookeeper.ssl.trustStore.password=testpass"
> {code}
> Please change keystore and truststore parameters accordingly.
> h3. Client
> You need to set jvm flags:
> {code}
> export 
> CLIENT_JVMFLAGS="-Dzookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
>  -Dzookeeper.client.secure=true 
> -Dzookeeper.ssl.keyStore.location=/root/zookeeper/ssl/testKeyStore.jks 
> -Dzookeeper.ssl.keyStore.password=testpass 
> -Dzookeeper.ssl.trustStore.location=/root/zookeeper/ssl/testTrustStore.jks 
> -Dzookeeper.ssl.trustStore.password=testpass"
> {code}
> change keystore and truststore parameters accordingly.
> And then connect to the server's SSL port, in this case:
> {code}
> bin/zkCli.sh -server 127.0.0.1:2281
> {code}
> If you have any feedback, you are more than welcome to discuss it here!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-25 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-2099:
-
Attachment: ZOOKEEPER-2099.patch

I updated the test so it hopefully won't see connection loss as a result of its 
own actions. For the two parts of the test where it causes a new leader to be 
elected, it now looks for the disconnect and reconnect on the correct client to 
make sure things have recovered completely. Using the existing CountdownWatcher 
utility class seemed to be the best fit for that, since using 
QuorumPeerMainTest::waitForOne might miss the disconnection event if the client 
reconnects while waitForOne is sleeping.

I did add some asynchronous methods to wait for connection or disconnection in 
CountdownWatcher to be able to start waiting for the disconnection before 
shutting down the server, avoiding a potential race condition that might cause 
the disconnection to be missed. They're just simple wrappers that return a 
Future and kick off waitForDisconnect or waitForConnect in a thread.

I still couldn't get the old version of the test to fail locally, so there 
might be issues with this one too, but I think it should be an improvement.

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch, 
> ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap canno

[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593379#comment-15593379
 ] 

Martin Kuchta commented on ZOOKEEPER-2099:
--

As a start, I think I should change the test to use waitForOne on the client 
being used for the operation after shutting down server 4 instead of 
waitForServerUp. This will at least make sure the test is waiting for the right 
thing before moving on and creating paths. I do expect the clients to be 
disconnected when taking down server 4 (since that should be the current 
leader), and if that's the only source of connection loss, the test might not 
need to retry.

As far as I can tell, most other tests don't retry on connection loss. Should 
they?



> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593051#comment-15593051
 ] 

Martin Kuchta commented on ZOOKEEPER-2099:
--

Looks like the test needs to be hardened a bit. I think I see the issue - 
QuorumBase.waitForServerUp doesn't guarantee that the client the test is using 
to create the nodes is also connected. I ran the test a few dozen times on my 
machine and saw no failures, but that's obviously not good enough.

As for the testLE failure, that seems to be a known flaky test (ZOOKEEPER-1932)

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-2099:
-
Attachment: ZOOKEEPER-2099.patch

Attached is an initial attempt at fixing this.

I was investigating an issue with massive inconsistency between ZK servers and 
developed my own test to reproduce the issue. I originally thought our issue 
was isolated to ephemeral znodes so I didn't find this issue in my searches, 
but the test I wrote ended up being almost exactly the same as the one provided 
by [~svoutil]. I did find another detail in my investigation that seems 
important - this bug not only causes the leader to not send the correct updates 
to the follower, but the leader can also incorrectly tell the follower to 
truncate its log, resulting in the loss of even more transactions on the 
follower. This requires a slightly different sequence of events (a few more 
steps). I found it easier to use my original test as a base and make some 
improvements using the test submitted here, so that's why the test I'm 
submitting looks so different.

h4. Test

Here's a description of what the test does:

# Enable forceSnapSync
# Set a very high snapshotSizeFactor to guarantee a DIFF if forceSnapSync is off
# Start 5 servers
# Create a baseline znode /w (not necessary, but shows where the data loss 
starts)
# Shutdown SID 4
# Create /x while SID 4 is down
# Shutdown SID 0
# Create /y while SIDs 0 and 4 are down
# Start SID 4 (which receives a SNAP from the current leader because of 
forceSnapSync=true)
# Create /z while SID 0 is down
# Disable forceSnapSync
# Shutdown current leader - SID 4 becomes leader
# Start SID 0 (which receives a TRUNC from SID 4 without the fix and a SNAP 
with the fix)
# Check for the presence of all znodes on all servers (without the fix, SID 0 
is missing /x and /y)

More detail on what goes wrong in step 13:

(Using W = the zxid of the transaction which creates /w, X for /x, etc.)

At this point, SID 4 has W and Z in its log and it has a snapshot containing 
the updates from W, X, and Y. It tries to sync with SID 0 (whose last zxid is 
Y), and iterates through its log until it finds a zxid > Y. It then looks back 
at the previous log entry (W), sees that W < Y, and tells SID 0 to truncate its 
log to W. After this, it starts sending updates at Z. SID 0 therefore deletes X 
and misses Y. The only correct thing for SID 4 to do here is to send a snapshot.

h4. Fix

The approach of writing a file with the last SNAP received to the data dir and 
checking that value when trying to sync with a follower seems best. The patch 
adds code to ZKDatabase to handle this file (called lastSnapReceived). 
LearnerHandler checks this lastSnapReceived value, and if it falls in the range 
of transactions a follower needs in syncFollower, a snapshot is sent.

We desperately need this fix because of the massive issues the bug is causing, 
so I will be doing as much testing as I can around it before fixing our 
internal version of ZK. It would be great to also get it polished to a state 
where it could be included in a future 3.5.x version.

Some big points to discuss:
* What should ZKDatabase/Learner do if it can't create or write to the file? It 
currently doesn't handle any exceptions which will result in the Learner 
stopping. This ensures correctness, but introduces another way for a Learner to 
fail.
* What should ZKDatabase/LearnerHandler do if it can't read the file? 
LearnerHandler currently catches all exceptions and falls back to sending a 
SNAP. This is always correct, but there will be performance loss in syncing new 
learners if the file becomes unreadable/corrupted somehow.
* Is there risk with upgrades or downgrades? It doesn't seem like there should 
be. Versions without the fix will just ignore the file if it's present in their 
data dir. Upgrading from a version without the fix to a version with the fix 
will result in the file being written when initializing the ZKDatabase.

Smaller points I couldn't decide on:
* Is it acceptable to enforce snapLog being non-null when constructing a 
ZKDatabase now? I had to modify some unit tests, but I liked that better than a 
test-only null check in the constructor.
* Should zookeeper.forceSnapSync and zookeeper.snapshotSizeFactor be settable 
system properties? The property names were included in the relevant classes but 
never used, and I wasn't sure if that was intended or not.
* Is IOUtils a good home for writeLongToFileAtomic since QuorumPeer and 
ZKDatabase both need that logic now?

Patch generated against master and seems to apply to branch-3.5. Not needed in 
branch-3.4 since the issue was introduced in 3.5.0


> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: http

[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-06-13 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328679#comment-15328679
 ] 

Martin Kuchta commented on ZOOKEEPER-1485:
--

I was thinking about the value of a test case around the overflow condition 
here. Like you said, we would need to add a set_xid function and move the xid 
variable outside the get_xid function to let both functions access it. I didn't 
see an existing pattern to add test-only code either. Do you have any 
suggestions?

As for the performance issue, that's also a valid point. I was a bit hesitant 
to replace the atomic operation at first, but I'm not sure it's actually an 
issue. The locking happens once per client request, the lock is held for a very 
short period of time, and each request already performs other locking on the ZK 
handle (enter_critical(), lock_buffer_list()). The one difference here might be 
that this is a global lock not tied to a particular zhandle, which could cause 
performance issues with multiple threads making requests with different 
zhandles.

I don't think you can correctly implement this using a plain CAS and atomic 
add. Any way you combine those operations, I think there's a chance the CAS 
won't trigger on the value you're checking against. Did you have a specific 
implementation in mind? I might be approaching it from the wrong angle when 
trying to reason about possible implementations.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-1485.patch
>
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-1485) client xid overflow is not handled

2016-06-13 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-1485:
-
Attachment: ZOOKEEPER-1485.patch

I am uploading a patch that takes the approach of wrapping the XID around when 
the max value is reached. The Java client and both multi- and single-threaded C 
implementations have been modified. They also now all initialize the XID to 1 
to match the Java client's behavior (the C implementations used epoch time 
before), though I don't think it matters.

We're likely going to deploy this fix internally, since we're hitting the 
overflow issue somewhat frequently given the number of ZooKeeper deployments we 
have and the volume of requests they're processing. It sounds like [~fanster.z] 
fixed the issue in a similar way.

We can discuss whether mainline ZooKeeper should take a different approach 
toward fixing this based on the points made above.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-1485.patch
>
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

2016-06-09 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta reassigned ZOOKEEPER-900:
---

Assignee: Martin Kuchta  (was: Vishal Kher)

> FLE implementation should be improved to use non-blocking sockets
> -
>
> Key: ZOOKEEPER-900
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Vishal Kher
>    Assignee: Martin Kuchta
>Priority: Critical
> Fix For: 3.5.2, 3.6.0
>
> Attachments: ZOOKEEPER-900-part2.patch, ZOOKEEPER-900.patch, 
> ZOOKEEPER-900.patch1, ZOOKEEPER-900.patch2
>
>
> From earlier email exchanges:
> 1. Blocking connects and accepts:
> a) The first problem is in manager.toSend(). This invokes connectOne(), which 
> does a blocking connect. While testing, I changed the code so that 
> connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() 
> does a socketChannel.connect(). After starting AsyncConnect, connectOne 
> starts a timer. connectOne continues with normal operations if the connection 
> is established before the timer expires, otherwise, when the timer expires it 
> interrupts AsyncConnect() thread and returns. In this way, I can have an 
> upper bound on the amount of time we need to wait for connect to succeed. Of 
> course, this was a quick fix for my testing. Ideally, we should use Selector 
> to do non-blocking connects/accepts. I am planning to do that later once we 
> at least have a quick fix for the problem and consensus from others for the 
> real fix (this problem is big blocker for us). Note that it is OK to do 
> blocking IO in SenderWorker and RecvWorker threads since they block IO to the 
> respective peer.
> b) The blocking IO problem is not just restricted to connectOne(), but also 
> in receiveConnection(). The Listener thread calls receiveConnection() for 
> each incoming connection request. receiveConnection does blocking IO to get 
> peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the 
> peer that had sent the connection request. All of this is happening from the 
> Listener. In short, if a peer fails after initiating a connection, the 
> Listener thread won't be able to accept connections from other peers, because 
> it would be stuck in read() or connetOne(). Also the code has an inherent 
> cycle. initiateConnection() and receiveConnection() will have to be very 
> carefully synchronized otherwise, we could run into deadlocks. This code is 
> going to be difficult to maintain/modify.
> Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2164) fast leader election keeps failing

2016-06-09 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-2164:
-
Assignee: Hongchao Deng  (was: Martin Kuchta)

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Assignee: Hongchao Deng
> Fix For: 3.5.2, 3.6.0
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ZOOKEEPER-2164) fast leader election keeps failing

2016-06-09 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta reassigned ZOOKEEPER-2164:


Assignee: Martin Kuchta  (was: Hongchao Deng)

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
> Fix For: 3.5.2, 3.6.0
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Unable to contribute on JIRA

2016-06-08 Thread Martin Kuchta
Confirming that everything seems to be working now. Thanks for the help, 
Patrick.

> On June 8, 2016 at 4:34 PM Marshall McMullen <marshall.mcmul...@gmail.com> 
> wrote:
> 
> Yep, it works now. I was able to assign the Jira to Martin without 
> problems now. Again, thanks.
> 
> On Wed, Jun 8, 2016 at 4:33 PM, Marshall McMullen 
> <marshall.mcmul...@gmail.com mailto:marshall.mcmul...@gmail.com > wrote:
> 
> > > Thank you very much for the assistance Patric.
> > 
> > On Wed, Jun 8, 2016 at 4:32 PM, Patrick Hunt <ph...@apache.org 
> > mailto:ph...@apache.org > wrote:
> > 
> > > > > I've added Martin as a contributor, give it another try.
> > > 
> > > Patrick
> > > 
> > > On Wed, Jun 8, 2016 at 3:21 PM, Marshall McMullen <
> > > marshall.mcmul...@gmail.com 
> > > mailto:marshall.mcmul...@gmail.com > wrote:
> > > 
> > > > That makes sense. I would appreciate if a committer can 
> > > change Martin's
> > > > role to be contributer. Otherwise we'll reach out to the 
> > > Infra team to get
> > > > some assistance on that.
> > > >
> > > > Thanks!
> > > >
> > > > On Wed, Jun 8, 2016 at 4:04 PM, Michael Han 
> > > <h...@cloudera.com mailto:h...@cloudera.com > wrote:
> > > >
> > > > > I think someone (a committer probably only) just needs 
> > > make Martin as a
> > > > > 'contributor' role.
> > > > >
> > > > > The best way to contact Apache Infra is through their 
> > > Hipchat channel
> > > > > http://www.apache.org/dev/infra-contact
> > > > >
> > > > > On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
> > > > > marshall.mcmul...@gmail.com 
> > > mailto:marshall.mcmul...@gmail.com > wrote:
> > > > >
> > > > > > Should Martin contact the "Apache Infrastructure Team" 
> > > regarding this?
> > > > If
> > > > > > so, how does he do that?
> > > > > >
> > > > > > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
> > > > > > marshall.mcmul...@gmail.com 
> > > mailto:marshall.mcmul...@gmail.com > wrote:
> > > > > >
> > > > > > > I tried to assign this Jira to him and got an error 
> > > message back:
> > > > > > >
> > > > > > > User 'makuchta' cannot be assigned issues.
> > > > > > >
> > > > > > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han 
> > > <h...@cloudera.com mailto:h...@cloudera.com >
> > > > wrote:
> > > > > > >
> > > > > > >> Martin,
> > > > > > >>
> > > > > > >> I had met similar issue earlier, here is an email 
> > > sent earlier to
> > > > dev
> > > > > > >> list:
> > > > > > >>
> > > > > > >> >>
> > > > > > >> FYI, I met an issue today that I can't attach files 
> > > to a JIRA issue
> > > > > with
> > > > > > >> the role of 'contributor'. Contacted Apache 
> > > Infrastructure team and
> > >     > > > >> confirmed that:
> > > > > > >>
> > > > > > >> - For a given JIRA issue, only *reporter*, or 
> > > *assignee*, or
> > > > > *committer*
> > > > > > >> can attach file.
> > > > > > >> - A contributor can only attach files to issues 
> > > that's assigned
> > > > and/or
> > > > > > >> reporting to the contributor.
> > > > > > >> - A workaround for a contributor to attach files to 
> > > any issue is to
> > 

Unable to contribute on JIRA

2016-06-08 Thread Martin Kuchta
Hi,

Does anyone know if I need to do anything special to have the ability to submit 
attachments and be assigned issues on JIRA? I was recently trying to submit a 
patch for ZOOKEEPER-2355 and realized the option was missing for me. It's not 
present on any other ZooKeeper JIRAs that I can see, although I can see it on 
JIRAs from other Apache projects.

I was working with Marshall McMullen to get the patch submitted, and our first 
thought was that the issue might need to be assigned to me, but even though he 
was able to reassign the issue, I was not a valid user to assign it to.

My account username is makuchta. I created it almost two weeks ago if that's of 
any relevance.


Thanks,

Martin

[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321459#comment-15321459
 ] 

Martin Kuchta commented on ZOOKEEPER-2355:
--

A search for the new test failures shows them as known flaky tests 
(ZOOKEEPER-1806, ZOOKEEPER-1807, ZOOKEEPER-2137), and they seem far removed 
from anything this patch touches. I haven't seen them at all locally.



> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321334#comment-15321334
 ] 

Martin Kuchta commented on ZOOKEEPER-2355:
--

I've been playing with some variations on the proposed fix and trying to reason 
about what's actually going wrong. When syncing with the leader, of the three 
leader responses (DIFF, SNAP, TRUNC), I think there's only an issue with 
setting the last processed ZXID the way it's currently done in the DIFF case. 
In the SNAP and TRUNC cases, we've already deserialized the snapshot or 
truncated the log by the time setLastProcessedZxid is called. In the DIFF case, 
the reason it's incorrect is because we're setting the last processed ZXID as 
if we've already committed all the transactions we're about to receive, so a 
failure before that actually happens leaves us in an inconsistent state.

The logic in the patch of moving the call to setLastProcessedZxid to when the 
follower receives UPTODATE or NEWLEADER makes sense to me, but this isn't 
consistent with the behavior expected by some of the other unit tests.

I don't think setLastProcessedZxid needs to be explicitly called at all when 
the follower receives a DIFF message because we will update the last processed 
ZXID as we commit transactions received from the leader anyway. I do think it 
needs to be preserved as-is for SNAP and TRUNC to keep the currently expected 
behavior. Whether there are other problematic scenarios associated with how 
SNAP and TRUNC are processed can be investigated separately since there may 
still be cases where the last processed ZXID and the actual transaction log 
state are out of sync.

I'm submitting a modified version of the patch provided by [~arshad.mohammad]. 
The patch includes his original unit test which still fails against trunk and 
passes with the patch, but the changes to Learner.java are the slightly 
different ones that I'm proposing.

(I do have two unit tests failing locally that are also failing against trunk, 
so I think it's an unrelated issue with my environment that I'll need to look 
into when I get time. If that turns out to not be the case based on the Jenkins 
build, I'll investigate.)

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-06 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316819#comment-15316819
 ] 

Martin Kuchta commented on ZOOKEEPER-2355:
--

[~arshad.mohammad]:

I'm looking into fixing this since I'm seeing the same issue. I understand the 
reasoning behind your fix, but it seems to be causing some other tests to fail 
consistently when applied to trunk. The Jenkins build is too old to view, but 
I'm guessing it failed for similar reasons. Were you seeing these failures and 
did you look at what was happening? I've only scratched the surface with 
investigating this bug and your patch, but I wanted to check to avoid repeating 
any work you had already done. I'll keep investigating to see if I can find a 
solution.

Failures listed below:

Zab1_0Test:
{noformat}
Testcase: testNormalFollowerRun took 4.198 sec
FAILED
expected:<4294967297> but was:<4294967296>
junit.framework.AssertionFailedError: expected:<4294967297> but was:<4294967296>
at 
org.apache.zookeeper.server.quorum.Zab1_0Test$4.converseWithFollower(Zab1_0Test.java:705)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testFollowerConversation(Zab1_0Test.java:511)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRun(Zab1_0Test.java:643)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)

Testcase: testNormalFollowerRunWithDiff took 4.073 sec
FAILED
expected:<4294967298> but was:<4294967296>
junit.framework.AssertionFailedError: expected:<4294967298> but was:<4294967296>
at 
org.apache.zookeeper.server.quorum.Zab1_0Test$5.converseWithFollower(Zab1_0Test.java:847)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testFollowerConversation(Zab1_0Test.java:511)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff(Zab1_0Test.java:771)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)

Testcase: testNormalObserverRun took 4.054 sec
FAILED
expected:<4294967298> but was:<4294967296>
junit.framework.AssertionFailedError: expected:<4294967298> but was:<4294967296>
at 
org.apache.zookeeper.server.quorum.Zab1_0Test$8.converseWithObserver(Zab1_0Test.java:1072)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testObserverConversation(Zab1_0Test.java:562)
at 
org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalObserverRun(Zab1_0Test.java:997)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)

{noformat}

ZxidRolloverTest:
{noformat}
Testcase: testRolloverThenFollowerRestart took 23.677 sec
Caused an ERROR
KeeperErrorCode = ConnectionLoss for /foofoofoo-connected
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /foofoofoo-connected
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1846)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1874)
at 
org.apache.zookeeper.server.ZxidRolloverTest.checkClientConnected(ZxidRolloverTest.java:119)
at 
org.apache.zookeeper.server.ZxidRolloverTest.checkClientsConnected(ZxidRolloverTest.java:90)
at 
org.apache.zookeeper.server.ZxidRolloverTest.start(ZxidRolloverTest.java:165)
at 
org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenFollowerRestart(ZxidRolloverTest.java:345)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)
{noformat}

WatchEventWhenAutoResetTest:
{noformat}
Testcase: testNodeChildrenChanged took 0.001 sec
Caused an ERROR
Timeout occurred. Please note the time in the report does not reflect the time 
until the timeout.
junit.framework.AssertionFailedError: Timeout occurred. Please note the time in 
the report does not reflect the time until the timeout.
{noformat}

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node Zo

[jira] [Comment Edited] (ZOOKEEPER-1485) client xid overflow is not handled

2016-06-03 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314712#comment-15314712
 ] 

Martin Kuchta edited comment on ZOOKEEPER-1485 at 6/3/16 8:12 PM:
--

That makes sense. I probably should have brushed up on my knowledge of the Java 
client and ZooKeeper states in general before asking, but thanks for the 
clarification.

With that information, I think having the server expire a client's session when 
it receives a negative XID would fix the issue in the Java client. The issue 
would remain in the C client because XID initialization is still tied to the 
start of the process and not the ZooKeeper handle initialization. I can see two 
potential fixes for the overall problem.

a) Expire the session upon receiving a negative XID. Modify the C client to 
make the XID a member of the zhandle struct to more closely mirror the Java 
client. Initialize the new XID struct member in zookeeper_init_internal.

b) Modify both clients to wrap the XID around as described above.

These aren't mutually exclusive either, but I think only one should be needed. 
Option (a) would ensure XID uniqueness for all requests made by a single 
connection (not currently done because XIDs will be reused when they wrap all 
the way around to the starting value). Option (b) is a simpler (and probably 
lower risk) fix, being a small value check in the client code only.


was (Author: makuchta):
That makes sense. I probably should have brushed up on my knowledge of the Java 
client and ZooKeeper states in general before asking, but thanks for the 
clarification.

With that information, I think having the server expire a client's session when 
it receives a negative XID would fix the issue in the Java client. The issue 
would remain in the C client because XID initialization is still tied to the 
start of the process and not the ZooKeeper handle initialization. I can see two 
potential fixes for the overall problem.

a) Expire the session upon receiving a negative XID. Modify the C client to 
make the XID a member of the zhandle struct to more closely mirror the Java 
client. Initialize the new XID struct member in zookeeper_init_internal.

b) Modify both clients to wrap the XID around as described above.

These aren't mutually exclusive either. Option (a) would ensure XID uniqueness 
for all requests made by a single connection (not currently done because XIDs 
will be reused when they wrap all the way around to the starting value). Option 
(b) is a simpler (and probably lower risk) fix, being a small value check in 
the client code only.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-31 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308091#comment-15308091
 ] 

Martin Kuchta commented on ZOOKEEPER-1485:
--

[~fpj], thanks for providing your thoughts.

To clarify a bit on the issue I had with just expiring the session:

In the C client, the only time the xid variable is initialized is when the 
client process starts. Expiring the session would force the client to 
reconnect, but I can't see how that would reset the xid since its lifetime is 
tied to the process. In the Java client, the xid is a field in ClientCnxn, 
which is a final field in the ZooKeeper class. Does session expiration force 
you to construct a new ZooKeeper object? My understanding of this is probably 
incomplete, but that's the angle I was looking at it from. 

That's not to say that expiring the session is wrong if the client sends a 
request with an invalid XID, but I still think the client itself needs to 
handle overflow in some way.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302709#comment-15302709
 ] 

Martin Kuchta commented on ZOOKEEPER-2318:
--

This segfault seems to be a symptom of 
[ZOOKEEPER-1485|https://issues.apache.org/jira/browse/ZOOKEEPER-1485].

I reproduced the same log error messages and backtrace by modifying the client 
xid initialization to set it to a negative value close to the special XIDs and 
performing a few simple operations from the c cli. When the client sends a 
request with an xid of -4, it treats the response it receives as an auth 
response and accesses zh->auth_h.auth->scheme, which was never set.

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302709#comment-15302709
 ] 

Martin Kuchta edited comment on ZOOKEEPER-2318 at 5/26/16 7:09 PM:
---

This segfault seems to be a symptom of 
[ZOOKEEPER-1485|https://issues.apache.org/jira/browse/ZOOKEEPER-1485].

I reproduced the same log error messages and backtrace by modifying the client 
xid initialization to set it to a negative value close to the special XIDs and 
performing a few simple operations from the c cli. When the client sends a 
request with an xid of \-4, it treats the response it receives as an auth 
response and accesses zh->auth_h.auth->scheme, which was never set.


was (Author: makuchta):
This segfault seems to be a symptom of 
[ZOOKEEPER-1485|https://issues.apache.org/jira/browse/ZOOKEEPER-1485].

I reproduced the same log error messages and backtrace by modifying the client 
xid initialization to set it to a negative value close to the special XIDs and 
performing a few simple operations from the c cli. When the client sends a 
request with an xid of -4, it treats the response it receives as an auth 
response and accesses zh->auth_h.auth->scheme, which was never set.

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Compilation error: syntax error in VERSION script

2016-02-20 Thread will martin
google didn’t author those pages.

> On Feb 20, 2016, at 9:00 PM, Patrick Hunt  wrote:
> 
> I've never seen this, but google seems to indicate that you might have
> grep options specified and it could be interfering with libtool?
> 
> https://www.google.com/webhp?sourceid=chrome-instant=1=2=UTF-8#q=%22ignoring%20invalid%20character%20%60%5C033%27%20in%20script%22
> 
> Patrick
> 
> 
> On Sat, Feb 20, 2016 at 5:30 PM, John Elaine  wrote:
>> Hi,
>> 
>> I am trying to compile the C bindings for Zookeeper (uses libtool), but I
>> am unable to do so. When I run make, I receive the following error:
>> 
>> /usr/bin/ld:.libs/libzookeeper_st.ver:2: ignoring invalid character
>> `\033' in script/usr/bin/ld:.libs/libzookeeper_st.ver:2: ignoring
>> invalid character `3' in
>> script/usr/bin/ld:.libs/libzookeeper_st.ver:2: ignoring invalid
>> character `5' in script/usr/bin/ld:.libs/libzookeeper_st.ver:2: syntax
>> error in VERSION script
>> collect2: error: ld returned 1 exit status
>> 
>> The detailed output of make command is available here:
>> http://stackoverflow.com/questions/35530964/zookeeper-compiling-error-syntax-error-in-version-script
>> 
>> Please let me know how I can resolve this issue.
>> 
>> - John



[jira] [Commented] (ZOOKEEPER-1460) IPv6 literal address not supported for quorum members

2014-07-14 Thread Dr. Martin Menzel (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061665#comment-14061665
 ] 

Dr. Martin Menzel commented on ZOOKEEPER-1460:
--

We should use a typical IPv6 address/port configuration like 

server.1=[1234::f4b5:3fff:fe0f:e96e]:2888:3888 

in the case of literal IPv6 addresses. I think this would be more handy than 
splitting the configuration in several different properties (host/port) like in 
the client case.

From my point of view this issue is not a IPv6 show stopper, because 
especially in the IPv6 case there are hostnames defined in DNS of /etc/hosts. 
If we use the hostnames the configuration is no problem.

 IPv6 literal address not supported for quorum members
 -

 Key: ZOOKEEPER-1460
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1460
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.4.3
Reporter: Chris Dolan
Assignee: Thawan Kooburat

 Via code inspection, I see that the server.nnn configuration key does not 
 support literal IPv6 addresses because the property value is split on :. In 
 v3.4.3, the problem is in QuorumPeerConfig:
 {noformat}
 String parts[] = value.split(:);
 InetSocketAddress addr = new InetSocketAddress(parts[0],
 Integer.parseInt(parts[1]));
 {noformat}
 In the current trunk 
 (http://svn.apache.org/viewvc/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java?view=markup)
  this code has been refactored into QuorumPeer.QuorumServer, but the bug 
 remains:
 {noformat}
 String serverClientParts[] = addressStr.split(;);
 String serverParts[] = serverClientParts[0].split(:);
 addr = new InetSocketAddress(serverParts[0],
 Integer.parseInt(serverParts[1]));
 {noformat}
 This bug probably affects very few users because most will naturally use a 
 hostname rather than a literal IP address. But given that IPv6 addresses are 
 supported for clients via ZOOKEEPER-667 it seems that server support should 
 be fixed too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Survey on Project Conventions

2014-05-20 Thread Martin Brandtner
Hello

My name is Martin Brandtner [1] and I’m a software engineering researcher
at the University of Zurich, Switzerland.
Together with Philipp Leitner [2], I currently work on an approach to
detect violations of project conventions based on data from the source code
repository, the issue tracker (e.g. Jira), and the build system (e.g.
Jenkins).

One example for such a project convention is: “You need to make sure that
the commit message contains at least the name of the contributor and
ideally a reference to the Bugzilla or JIRA issue where the patch was
submitted.” [3]

The idea is that our approach can detect violation of such a convention
automatically and therefore support the development process.

First of all we need conventions and that’s why we ask you to take part in
our survey. In the survey, we present five conventions and want you to rate
their relevance in your Apache project. Everybody contributing to your
Apache project can take part in this survey because we also want to see if
different roles may have different opinions about a convention.
The survey is totally anonymous and it will take about 15 minutes to answer
it.

We would be happy if you could fill out our survey under:
http://ww3.unipark.de/uc/SEAL_Research/1abe/ before May 30, 2014.

With the data collected in this survey we will implement a convention
violation detection in our tool called SQA-Timeline [4]. If your are
interested in our work, contact us via email or provide your email address
in the survey.

Best regards,
Martin and Philipp

[1] http://www.ifi.uzh.ch/seal/people/brandtner.html
[2] http://www.ifi.uzh.ch/seal/people/leitner.html
[3] http://www.apache.org/dev/committers.html#applying-patches
[4] https://www.youtube.com/watch?v=ZIsOODUapAE


[jira] [Commented] (ZOOKEEPER-978) ZookeeperServer does not close zk database on shutdwon

2011-04-11 Thread Martin Serrano (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13018462#comment-13018462
 ] 

Martin Serrano commented on ZOOKEEPER-978:
--

I encountered the same problem.  A simple patch:

{noformat}
### Eclipse Workspace Patch 1.0
#P Attivio - App
Index: src/org/apache/zookeeper/server/ZooKeeperServerMain.java
===
--- src/org/apache/zookeeper/server/ZooKeeperServerMain.java(revision 40109)
+++ src/org/apache/zookeeper/server/ZooKeeperServerMain.java(working copy)
@@ -111,6 +111,7 @@
   if (zkServer.isRunning()) {
 zkServer.shutdown();
   }
+  ftxn.close();
 } catch (InterruptedException e) {
   // warn, but generally this is ok
   LOG.warn(Server interrupted, e);

{noformat}

 ZookeeperServer does not close zk database on shutdwon
 --

 Key: ZOOKEEPER-978
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-978
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.2
Reporter: Sergei Bobovich

 ZookeeperServer does not close zk database on shutdown leaving log files 
 open. Not sure if this is an intention, but looks like a possible bug to me. 
 Database is getting closed only from QuorumPeer class. 
 Hit it when executing regression tests on windows: failed to delete log files 
 from cleanup.
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira