[jira] [Resolved] (HADOOP-17312) S3AInputStream to be resilient to faiures in abort(); translate AWS Exceptions

2020-10-30 Thread Yongjun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-17312.

Resolution: Duplicate

> S3AInputStream to be resilient to faiures in abort(); translate AWS Exceptions
> --
>
> Key: HADOOP-17312
> URL: https://issues.apache.org/jira/browse/HADOOP-17312
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Steve Loughran
>Priority: Major
>
> Stack overflow issue complaining about ConnectionClosedException during 
> S3AInputStream close(), seems triggered by an EOF exception in abort. That 
> is: we are trying to close the stream and it is failing because the stream is 
> closed. oops.
> https://stackoverflow.com/questions/64412010/pyspark-org-apache-http-connectionclosedexception-premature-end-of-content-leng
> Looking @ the stack, we aren't translating AWS exceptions in abort() to IOEs, 
> which may be a factor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-17338) Intermittent S3AInputStream failures: Premature end of Content-Length delimited message body etc

2020-10-30 Thread Yongjun Zhang (Jira)
Yongjun Zhang created HADOOP-17338:
--

 Summary: Intermittent S3AInputStream failures: Premature end of 
Content-Length delimited message body etc
 Key: HADOOP-17338
 URL: https://issues.apache.org/jira/browse/HADOOP-17338
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/s3
Affects Versions: 3.3.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


We are seeing the following exceptions intermittently when using S3AInputSteam 
(see Symptoms at the bottom).

Inspired by
https://stackoverflow.com/questions/9952815/s3-java-client-fails-a-lot-with-premature-end-of-content-length-delimited-messa
and
https://forums.aws.amazon.com/thread.jspa?threadID=83326,  we got a solution 
that has helped us, would like to put the fix to the community version.

The problem is that S3AInputStream had a short-lived S3Object which is used to 
create the wrappedSteam, and this object got garbage collected and random time, 
which caused the stream to be closed, thus the symptoms reported. 

https://github.com/aws/aws-sdk-java/blob/1.11.295/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/S3Object.java#L225
 is the s3 code that closes the stream when S3 object is garbage collected:

Here is the code in S3AInputStream that creates temporary S3Object and uses it 
to create the wrappedStream:

{code}
   S3Object object = Invoker.once(text, uri,
() -> client.getObject(request));

changeTracker.processResponse(object, operation,
targetPos);
wrappedStream = object.getObjectContent();
{code}

Symptoms:

1.
{code}
Caused by: com.amazonaws.thirdparty.apache.http.ConnectionClosedException: 
Premature end of Content-Length delimited message body (expected: 156463674; 
received: 150001089
at 
com.amazonaws.thirdparty.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at 
com.amazonaws.thirdparty.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:181)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:779)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:208)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:63)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
... 15 more
{code}

2.
{code}
Caused by: javax.net.ssl.SSLException: SSL peer shut down incorrectly
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:596)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:990)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:948)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at 
com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at 
com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:198)
at 
com.amazonaws.thirdparty.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
at 
com.amazonaws.thirdparty.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at 

[jira] [Created] (HADOOP-15720) rpcTimeout may not have been applied correctly

2018-09-04 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-15720:
--

 Summary: rpcTimeout may not have been applied correctly
 Key: HADOOP-15720
 URL: https://issues.apache.org/jira/browse/HADOOP-15720
 Project: Hadoop Common
  Issue Type: Bug
  Components: common
Reporter: Yongjun Zhang


org.apache.hadoop.ipc.Client send multiple RPC calls to server synchronously 
via the same connection as in the following synchronized code block:
{code:java}
  synchronized (sendRpcRequestLock) {
Future senderFuture = sendParamsExecutor.submit(new Runnable() {
  @Override
  public void run() {
try {
  synchronized (Connection.this.out) {
if (shouldCloseConnection.get()) {
  return;
}

if (LOG.isDebugEnabled()) {
  LOG.debug(getName() + " sending #" + call.id
  + " " + call.rpcRequest);
}
 
byte[] data = d.getData();
int totalLength = d.getLength();
out.writeInt(totalLength); // Total Length
out.write(data, 0, totalLength);// RpcRequestHeader + RpcRequest
out.flush();
  }
} catch (IOException e) {
  // exception at this point would leave the connection in an
  // unrecoverable state (eg half a call left on the wire).
  // So, close the connection, killing any outstanding calls
  markClosed(e);
} finally {
  //the buffer is just an in-memory buffer, but it is still polite 
to
  // close early
  IOUtils.closeStream(d);
}
  }
});
  
try {
  senderFuture.get();
} catch (ExecutionException e) {
  Throwable cause = e.getCause();
  
  // cause should only be a RuntimeException as the Runnable above
  // catches IOException
  if (cause instanceof RuntimeException) {
throw (RuntimeException) cause;
  } else {
throw new RuntimeException("unexpected checked exception", cause);
  }
}
  }
{code}
And it then waits for result asynchronously via
{code:java}
/* Receive a response.
 * Because only one receiver, so no synchronization on in.
 */
private void receiveRpcResponse() {
  if (shouldCloseConnection.get()) {
return;
  }
  touch();
  
  try {
int totalLen = in.readInt();
RpcResponseHeaderProto header = 
RpcResponseHeaderProto.parseDelimitedFrom(in);
checkResponse(header);

int headerLen = header.getSerializedSize();
headerLen += CodedOutputStream.computeRawVarint32Size(headerLen);

int callId = header.getCallId();
if (LOG.isDebugEnabled())
  LOG.debug(getName() + " got value #" + callId);

Call call = calls.get(callId);
RpcStatusProto status = header.getStatus();
..
{code}
However, we can see that the {{call}} returned by {{receiveRpcResonse()}} above 
may be in any order.

The following code
{code:java}
int totalLen = in.readInt();
{code}
eventually calls one of the following two methods, where rpcTimeOut is checked 
against:
{code:java}
  /** Read a byte from the stream.
   * Send a ping if timeout on read. Retries if no failure is detected
   * until a byte is read.
   * @throws IOException for any IO problem other than socket timeout
   */
  @Override
  public int read() throws IOException {
int waiting = 0;
do {
  try {
return super.read();
  } catch (SocketTimeoutException e) {
waiting += soTimeout;
handleTimeout(e, waiting);
  }
} while (true);
  }

  /** Read bytes into a buffer starting from offset off
   * Send a ping if timeout on read. Retries if no failure is detected
   * until a byte is read.
   * 
   * @return the total number of bytes read; -1 if the connection is closed.
   */
  @Override
  public int read(byte[] buf, int off, int len) throws IOException {
int waiting = 0;
do {
  try {
return super.read(buf, off, len);
  } catch (SocketTimeoutException e) {
waiting += soTimeout;
handleTimeout(e, waiting);
  }
} while (true);
  }
{code}
But the waiting time is always initialized to 0 for each of the above read 
calls, so each call can take up to rpcTimeout. And the real time to time out a 
call appears to be accumulative.

For example, if the client issue call1, call2, then it waits for result, if the 
first call1 took (rpcTimeout - 1), thus no time out, the second took 
(rpcTimeout -1), thus no timeout, but it 

[jira] [Created] (HADOOP-15590) Two gpg related errors when doing hadoop release

2018-07-09 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-15590:
--

 Summary: Two gpg related errors when doing hadoop release
 Key: HADOOP-15590
 URL: https://issues.apache.org/jira/browse/HADOOP-15590
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang


When doing 3.0.3 release, by running command

dev-support/bin/create-release --asfrelease --docker --dockercache

documented in

https://wiki.apache.org/hadoop/HowToRelease

I hit the following problems:

1. 
{quote}
starting gpg agent ERROR: Unable to launch or acquire gpg-agent. Disable 
signing.
{quote}
The script expect GPG_AGENT_INFO env being set with needed info by the 
gpg-agent. However, it was not. This is because of changes made in gpg-agent. I 
found the workaround is to add the following line to 
dev-support/bin/create-release script right after starting gpg-agent:
{quote}
export GPG_AGENT_INFO="~/.gnupg/S.gpg-agent:$(pgrep gpg-agent):1"
{quote}

2.
{quote}
gpg: can't connect to `~/.gnupg/S.gpg-agent': invalid value
{quote}
I found that this is caused by unmatching gpg-agent and gpg versions installed 
via Docker. I modified dev-support/docker/Dockerfile to install gnupg2 instead 
of gnupg. This made gpg and gpg-agent both 2.1.11 instead of one on 2.1.11 the 
other on 1.14. And this solved the above problem. 







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-15538) Possible dead lock in Client

2018-06-14 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-15538:
--

 Summary: Possible dead lock in Client
 Key: HADOOP-15538
 URL: https://issues.apache.org/jira/browse/HADOOP-15538
 Project: Hadoop Common
  Issue Type: Bug
  Components: common
Reporter: Yongjun Zhang


We have a jstack collection that spans 13 minutes. Once frame per ~1.5 minutes. 
And for each of the frame, I observed the following:

{code}
Found one Java-level deadlock:
=
"IPC Parameter Sending Thread #294":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800

Java stack information for the threads listed above:
===
"IPC Parameter Sending Thread #294":
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
- waiting to lock <0x000621745390> (a java.lang.Object)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
- locked <0x000621745380> (a java.lang.Object)
at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
- locked <0x000621749850> (a java.io.BufferedOutputStream)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
- locked <0x00062174b878> (a java.io.DataOutputStream)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Found one Java-level deadlock:
=
"IPC Client (297602875) connection to x.y.z.p:8020 from impala":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800

Java stack information for the threads listed above:
===
"IPC Client (297602875) connection to x.y.z.p:8020 from impala":
at 
sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
- waiting to lock <0x000621745390> (a java.lang.Object)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
- locked <0x000621745370> (a java.lang.Object)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0006217476f0> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)

Found 2 deadlocks.
{code}

This happens with jdk1.8.0_162, and the code appears to match 
https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java.

The first thread is blocked at:

https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268

The second thread is blocked at:
https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279

There are two issues here:

1. There seems to be a real deadlock because the stacks remain the same even if 
the first an last jstack frames captured is 13 minutes apart.

2. java deadlock report seems to be problematic, two threads that have deadlock 
should not be blocked on the same lock, but they appear to 

[jira] [Created] (HADOOP-15530) RPC could stuck at senderFuture.get()

2018-06-11 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-15530:
--

 Summary: RPC could stuck at senderFuture.get()
 Key: HADOOP-15530
 URL: https://issues.apache.org/jira/browse/HADOOP-15530
 Project: Hadoop Common
  Issue Type: Bug
  Components: common
Reporter: Yongjun Zhang


In Client.java, sendRpcRequest does the following

{code}
   /** Initiates a rpc call by sending the rpc request to the remote server.
 * Note: this is not called from the Connection thread, but by other
 * threads.
 * @param call - the rpc request
 */
public void sendRpcRequest(final Call call)
throws InterruptedException, IOException {
  if (shouldCloseConnection.get()) {
return;
  }

  // Serialize the call to be sent. This is done from the actual
  // caller thread, rather than the sendParamsExecutor thread,

  // so that if the serialization throws an error, it is reported
  // properly. This also parallelizes the serialization.
  //
  // Format of a call on the wire:
  // 0) Length of rest below (1 + 2)
  // 1) RpcRequestHeader  - is serialized Delimited hence contains length
  // 2) RpcRequest
  //
  // Items '1' and '2' are prepared here. 
  RpcRequestHeaderProto header = ProtoUtil.makeRpcRequestHeader(
  call.rpcKind, OperationProto.RPC_FINAL_PACKET, call.id, call.retry,
  clientId);

  final ResponseBuffer buf = new ResponseBuffer();
  header.writeDelimitedTo(buf);
  RpcWritable.wrap(call.rpcRequest).writeTo(buf);

  synchronized (sendRpcRequestLock) {
Future senderFuture = sendParamsExecutor.submit(new Runnable() {
  @Override
  public void run() {
try {
  synchronized (ipcStreams.out) {
if (shouldCloseConnection.get()) {
  return;
}
if (LOG.isDebugEnabled()) {
  LOG.debug(getName() + " sending #" + call.id
  + " " + call.rpcRequest);
}
// RpcRequestHeader + RpcRequest
ipcStreams.sendRequest(buf.toByteArray());
ipcStreams.flush();
  }
} catch (IOException e) {
  // exception at this point would leave the connection in an
  // unrecoverable state (eg half a call left on the wire).
  // So, close the connection, killing any outstanding calls
  markClosed(e);
} finally {
  //the buffer is just an in-memory buffer, but it is still polite 
to
  // close early
  IOUtils.closeStream(buf);
}
  }
});

try {
  senderFuture.get();
} catch (ExecutionException e) {
  Throwable cause = e.getCause();

  // cause should only be a RuntimeException as the Runnable above
  // catches IOException
  if (cause instanceof RuntimeException) {
throw (RuntimeException) cause;
  } else {
throw new RuntimeException("unexpected checked exception", cause);
  }
}
  }
}
{code}

It's observed that the call can be stuck at {{senderFuture.get();}}

Given that we support rpcTimeOut, we could chose the second method of Future 
below:
{code}
  /**
 * Waits if necessary for the computation to complete, and then
 * retrieves its result.
 *
 * @return the computed result
 * @throws CancellationException if the computation was cancelled
 * @throws ExecutionException if the computation threw an
 * exception
 * @throws InterruptedException if the current thread was interrupted
 * while waiting
 */
V get() throws InterruptedException, ExecutionException;

/**
 * Waits if necessary for at most the given time for the computation
 * to complete, and then retrieves its result, if available.
 *
 * @param timeout the maximum time to wait
 * @param unit the time unit of the timeout argument
 * @return the computed result
 * @throws CancellationException if the computation was cancelled
 * @throws ExecutionException if the computation threw an
 * exception
 * @throws InterruptedException if the current thread was interrupted
 * while waiting
 * @throws TimeoutException if the wait timed out
 */
V get(long timeout, TimeUnit unit)
throws InterruptedException, ExecutionException, TimeoutException;
{code}

In theory, since the RPC at client is serialized, we could just use the main 
thread to do the execution, instead of using a threadpool to create new thread. 
This can be discussed in a separate jira.




  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Resolved] (HADOOP-14262) rpcTimeOut is not set up correctly in Client thus client doesn't time out

2017-08-12 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-14262.

Resolution: Duplicate

> rpcTimeOut is not set up correctly in Client thus client doesn't time out
> -
>
> Key: HADOOP-14262
> URL: https://issues.apache.org/jira/browse/HADOOP-14262
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>
> NameNodeProxies.createNNProxyWithClientProtocol  does
> {code}
>   ClientNamenodeProtocolPB proxy = RPC.getProtocolProxy(
> ClientNamenodeProtocolPB.class, version, address, ugi, conf,
> NetUtils.getDefaultSocketFactory(conf),
> org.apache.hadoop.ipc.Client.getTimeout(conf), defaultPolicy,
> fallbackToSimpleAuth).getProxy();
> {code}
> which calls Client.getTimeOut(conf) to get timeout value. 
> Client.getTimeOut(conf) doesn't consider IPC_CLIENT_RPC_TIMEOUT_KEY right 
> now. Thus rpcTimeOut doesn't take effect for relevant RPC calls, and they 
> hang!
> For example, receiveRpcResponse blocked forever at:
> {code}
> Thread 16127: (state = BLOCKED)   
>   
>  - sun.nio.ch.SocketChannelImpl.readerCleanup() @bci=6, line=279 (Compiled 
> frame)   
>  - sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=205, line=390 
> (Compiled frame)   
>  - 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(java.nio.ByteBuffer) 
> @bci=5, line=57 (Compiled frame)
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) 
> @bci=35, line=142 (Compiled frame)
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, 
> line=161 (Compiled frame)
>  - org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, 
> line=131 (Compiled frame) 
>  - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 
> (Compiled frame)   
>  - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 
> (Compiled frame)   
>  - org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, 
> int) @bci=4, line=521 (Compiled frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame) 
>   
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)  
>   
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>   
>  - org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse() @bci=19, 
> line=1081 (Compiled frame) 
>  - org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=976 (Compiled 
> frame) 
> {code}
> Filing this jira to fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14526) Examine code base for cases that exception is thrown from finally block

2017-06-14 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14526:
--

 Summary: Examine code base for cases that exception is thrown from 
finally block
 Key: HADOOP-14526
 URL: https://issues.apache.org/jira/browse/HADOOP-14526
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang


If exception X is thrown in try block, and exception Y is thrown is finally 
block, X will be swallowed. 

In addition, finally block is used to ensure resources are released properly in 
general. If we throw exception from there, some resources may be leaked. So 
it's not recommended to throw exception in the finally block

I caught one today and reported HDFS-11794, creating this jira as a master one 
to catch other similar cases.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14496) Logs for KMS delegation token lifecycle

2017-06-06 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-14496.

Resolution: Duplicate

> Logs for KMS delegation token lifecycle
> ---
>
> Key: HADOOP-14496
> URL: https://issues.apache.org/jira/browse/HADOOP-14496
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Yongjun Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14496) Logs for KMS delegation token lifecycle

2017-06-06 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14496:
--

 Summary: Logs for KMS delegation token lifecycle
 Key: HADOOP-14496
 URL: https://issues.apache.org/jira/browse/HADOOP-14496
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Yongjun Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size

2017-05-25 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-14407.

Resolution: Fixed

> DistCp - Introduce a configurable copy buffer size
> --
>
> Key: HADOOP-14407
> URL: https://issues.apache.org/jira/browse/HADOOP-14407
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Omkar Aradhya K S
>Assignee: Omkar Aradhya K S
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, 
> HADOOP-14407.002.patch, HADOOP-14407.003.patch, 
> HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, 
> HADOOP-14407.004.patch, HADOOP-14407.branch2.002.patch, 
> TotalTime-vs-CopyBufferSize.jpg
>
>
> Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just 
> 8KB. We have noticed in our performance tests that with bigger buffer sizes 
> we saw upto ~3x performance boost. Hence, making the copy buffer size a 
> configurable setting via the new parameter .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Reopened] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size

2017-05-15 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang reopened HADOOP-14407:


> DistCp - Introduce a configurable copy buffer size
> --
>
> Key: HADOOP-14407
> URL: https://issues.apache.org/jira/browse/HADOOP-14407
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Omkar Aradhya K S
>Assignee: Omkar Aradhya K S
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: HADOOP-14407.001.patch
>
>
> Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just 
> 8KB. We have noticed in our performance tests that with bigger buffer sizes 
> we saw upto ~3x performance boost. Hence, making the copy buffer size a 
> configurable setting via the new parameter .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size

2017-05-15 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-14407.

Resolution: Information Provided

> DistCp - Introduce a configurable copy buffer size
> --
>
> Key: HADOOP-14407
> URL: https://issues.apache.org/jira/browse/HADOOP-14407
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Omkar Aradhya K S
>Assignee: Omkar Aradhya K S
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: HADOOP-14407.001.patch
>
>
> Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just 
> 8KB. We have noticed in our performance tests that with bigger buffer sizes 
> we saw upto ~3x performance boost. Hence, making the copy buffer size a 
> configurable setting via the new parameter .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14333) HADOOP-14104 changed DFSClient API isHDFSEncryptionEnabled, impacted hacky hive code

2017-04-20 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14333:
--

 Summary: HADOOP-14104 changed DFSClient API 
isHDFSEncryptionEnabled, impacted hacky hive code 
 Key: HADOOP-14333
 URL: https://issues.apache.org/jira/browse/HADOOP-14333
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang


Though Hive should be fixed not to access DFSClient which is private to HADOOP, 
removing the throws added by HADOOP-14104 is a quicker solution to unblock hive.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14322) Incorrect host info may be reported in failover message

2017-04-18 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14322:
--

 Summary: Incorrect host info may be reported in failover message
 Key: HADOOP-14322
 URL: https://issues.apache.org/jira/browse/HADOOP-14322
 Project: Hadoop Common
  Issue Type: Bug
  Components: common
Reporter: Yongjun Zhang


This may apply to other components, but using HDFS as an example.

When multiple threads use the same DFSClient to make RPC calls, they may report 
incorrect NN host name in the failover message:
{code}
INFO [pool-3-thread-13] retry.RetryInvocationHandler 
(RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete of 
class ClientNamenodeProtocolTranslatorPB over *a.b.c.d*:8020. Trying to fail 
over immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
https://s.apache.org/sbnn-error
{code}

where *a.b.c.d* is the RPC proxy corresponds to the active NN, which confuses 
user to think failover is not behaving correctly. Because *a.b.c.d*  is 
expected to be the proxy corresponding to the standby NN here instead.

The reason is that when the ProxyDescriptor data field of 
RetryInvocationHandler may be shared by multiple threads that do the RPC calls, 
the failover done by one thread (which changed the rpc proxy) may be visible to 
other threads when the other threads report the above message. 

An example sequence: 
# multiple threads start with the same SNN to do RPC calls, 
# all threads discover that a failover is needed, 
# thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
# other threads reports the above message with the proxyInfo changed by thread 
X, and reported ANN instead of SNN in the message.

Some details:

RetryInvocationHandler does the following when failing over:
{code}
  synchronized void failover(long expectedFailoverCount, Method method,
   int callId) {
  // Make sure that concurrent failed invocations only cause a single
  // actual failover.
  if (failoverCount == expectedFailoverCount) {
fpp.performFailover(proxyInfo.proxy);
failoverCount++;
  } else {
LOG.warn("A failover has occurred since the start of call #" + callId
+ " " + proxyInfo.getString(method.getName()));
  }
  proxyInfo = fpp.getProxy();
}
{code}
and changed the proxyInfo in the ProxyDescriptor.

While the log method below report message with ProxyDescriotor's proxyinfo:
{code}
private void log(final Method method, final boolean isFailover,
  final int failovers, final long delay, final Exception ex) {
..
   final StringBuilder b = new StringBuilder()
.append(ex + ", while invoking ")
.append(proxyDescriptor.getProxyInfo().getString(method.getName()));
if (failovers > 0) {
  b.append(" after ").append(failovers).append(" failover attempts");
}
b.append(isFailover? ". Trying to failover ": ". Retrying ");
b.append(delay > 0? "after sleeping for " + delay + "ms.": "immediately.");
{code}
and so does  {{handleException}} method do
{code}
if (LOG.isDebugEnabled()) {
  LOG.debug("Exception while invoking call #" + callId + " "
  + proxyDescriptor.getProxyInfo().getString(method.getName())
  + ". Not retrying because " + retryInfo.action.reason, e);
}
{code}

FailoverProxyProvider
{code}
   public String getString(String methodName) {
  return proxy.getClass().getSimpleName() + "." + methodName
  + " over " + proxyInfo;
}

@Override
public String toString() {
  return proxy.getClass().getSimpleName() + " over " + proxyInfo;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14198) Should have a way to let PingInputStream to abort

2017-03-31 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-14198.

Resolution: Duplicate

> Should have a way to let PingInputStream to abort
> -
>
> Key: HADOOP-14198
> URL: https://issues.apache.org/jira/browse/HADOOP-14198
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>
> We observed a case that RPC call get stuck, since PingInputStream does the 
> following
> {code}
>  /** This class sends a ping to the remote side when timeout on
>  * reading. If no failure is detected, it retries until at least
>  * a byte is read.
>  */
> private class PingInputStream extends FilterInputStream {
> {code}
> It seems that in this case no data is ever received, and it keeps pinging.
> Should we ping forever here? Maybe we should introduce a config to stop the 
> ping after pinging for certain number of times, and report back timeout, let 
> the caller to retry the RPC?
> Wonder if there is chance the RPC get dropped somehow by the server so no 
> response is ever received.
> See 
> {code}
> Thread 16127: (state = BLOCKED)   
>   
>  - sun.nio.ch.SocketChannelImpl.readerCleanup() @bci=6, line=279 (Compiled 
> frame)   
>  - sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=205, line=390 
> (Compiled frame)   
>  - 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(java.nio.ByteBuffer) 
> @bci=5, line=57 (Compiled frame)
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) 
> @bci=35, line=142 (Compiled frame)
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, 
> line=161 (Compiled frame)
>  - org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, 
> line=131 (Compiled frame) 
>  - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 
> (Compiled frame)   
>  - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 
> (Compiled frame)   
>  - org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, 
> int) @bci=4, line=521 (Compiled frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame) 
>   
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)  
>   
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>   
>  - org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse() @bci=19, 
> line=1081 (Compiled frame) 
>  - org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=976 (Compiled 
> frame) 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14262) rpcTimeOut is not set up correctly in Client thus client doesn't time out

2017-03-31 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14262:
--

 Summary: rpcTimeOut is not set up correctly in Client thus client 
doesn't time out
 Key: HADOOP-14262
 URL: https://issues.apache.org/jira/browse/HADOOP-14262
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


NameNodeProxies.createNNProxyWithClientProtocol  does

{code}
  ClientNamenodeProtocolPB proxy = RPC.getProtocolProxy(
ClientNamenodeProtocolPB.class, version, address, ugi, conf,
NetUtils.getDefaultSocketFactory(conf),
org.apache.hadoop.ipc.Client.getTimeout(conf), defaultPolicy,
fallbackToSimpleAuth).getProxy();
{code}
which calls Client.getTimeOut(conf) to get timeout value. 

Client.getTimeOut(conf) doesn't consider IPC_CLIENT_RPC_TIMEOUT_KEY right now. 
Thus rpcTimeOut doesn't take effect for relevant RPC calls, and they hang!

For example, receiveRpcResponse blocked forever at:
{code}
Thread 16127: (state = BLOCKED) 

 - sun.nio.ch.SocketChannelImpl.readerCleanup() @bci=6, line=279 (Compiled 
frame)   
 - sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=205, line=390 
(Compiled frame)   
 - 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(java.nio.ByteBuffer) 
@bci=5, line=57 (Compiled frame)
 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) 
@bci=35, line=142 (Compiled frame)
 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, 
line=161 (Compiled frame)
 - org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, 
line=131 (Compiled frame) 
 - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Compiled 
frame)   
 - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Compiled 
frame)   
 - org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, 
int) @bci=4, line=521 (Compiled frame)
 - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)   

 - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)

 - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)  

 - org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse() @bci=19, 
line=1081 (Compiled frame) 
 - org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=976 (Compiled 
frame) 
{code}

Filing this jira to fix it.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14198) Should have a way to let PingInputStream to abort

2017-03-19 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-14198:
--

 Summary: Should have a way to let PingInputStream to abort
 Key: HADOOP-14198
 URL: https://issues.apache.org/jira/browse/HADOOP-14198
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang


We observed a case that RPC call get stuck, since PingInputStream does the 
following

{code}
 /** This class sends a ping to the remote side when timeout on
 * reading. If no failure is detected, it retries until at least
 * a byte is read.
 */
private class PingInputStream extends FilterInputStream {
{code}

It seems that in this case no data is ever received, and it keeps pinging.

Should we ping forever here? Maybe we should introduce a config to stop the 
ping after pinging for certain number of times, and report back timeout, let 
the caller to retry the RPC?

Wonder if there is chance the RPC get dropped somehow by the server so no 
response is ever received.

See 
{code}
Thread 16127: (state = BLOCKED) 

 - sun.nio.ch.SocketChannelImpl.readerCleanup() @bci=6, line=279 (Compiled 
frame)   
 - sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=205, line=390 
(Compiled frame)   
 - 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(java.nio.ByteBuffer) 
@bci=5, line=57 (Compiled frame)
 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) 
@bci=35, line=142 (Compiled frame)
 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, 
line=161 (Compiled frame)
 - org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, 
line=131 (Compiled frame) 
 - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Compiled 
frame)   
 - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 (Compiled 
frame)   
 - org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, 
int) @bci=4, line=521 (Compiled frame)
 - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)   

 - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)

 - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)  

 - org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse() @bci=19, 
line=1081 (Compiled frame) 
 - org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=976 (Compiled 
frame) 
{code}


 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13720) Add more info to "token ... is expired" message

2016-10-13 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-13720:
--

 Summary: Add more info to "token ... is expired" message
 Key: HADOOP-13720
 URL: https://issues.apache.org/jira/browse/HADOOP-13720
 Project: Hadoop Common
  Issue Type: Bug
  Components: common, security
Reporter: Yongjun Zhang


Currently AbstractDelegationTokenSecretM anager$checkToken does

{code}
  protected DelegationTokenInformation checkToken(TokenIdent identifier)
  throws InvalidToken {
assert Thread.holdsLock(this);
DelegationTokenInformation info = getTokenInfo(identifier);
if (info == null) {
  throw new InvalidToken("token (" + identifier.toString()
  + ") can't be found in cache");
}
if (info.getRenewDate() < Time.now()) {
  throw new InvalidToken("token (" + identifier.toString() + ") is 
expired");
}
return info;
  } 
{code}

When a token is expried, we throw the above exception without printing out the 
{{info.getRenewDate()}} in the message. If we print it out, we could know for 
how long the token has not been renewed. This will help us investigate certain 
issues.

Create this jira as a request to add that part.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-12604) Exception may be swallowed in KMSClientProvider

2015-11-26 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-12604:
--

 Summary: Exception may be swallowed in KMSClientProvider
 Key: HADOOP-12604
 URL: https://issues.apache.org/jira/browse/HADOOP-12604
 Project: Hadoop Common
  Issue Type: Bug
  Components: kms
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


In KMSClientProvider# createConnection
{code}
  try {
is = conn.getInputStream();
ret = mapper.readValue(is, klass);
  } catch (IOException ex) {
if (is != null) {
  is.close(); <== close may throw exception
}
throw ex;
  } finally {
if (is != null) {
  is.close();
}
  }
}
{code}

{{ex}} may be swallowed when {{close}} highlighted in the code throws 
exception.  Thanks [~qwertymaniac] for pointing this out.

BTW, I think we should be able to consolidate the two {{is.close()}} in the 
above code, so we don't close the same stream twice. The one in the {{finally 
block}} may be called after an exception is thrown or not, and it may throw 
exception too, we need to be careful not to swallow exception here too.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-12517) Findbugs reported 0 issues, but summary

2015-10-27 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-12517:
--

 Summary: Findbugs reported 0 issues, but summary 
 Key: HADOOP-12517
 URL: https://issues.apache.org/jira/browse/HADOOP-12517
 Project: Hadoop Common
  Issue Type: Bug
  Components: build
Reporter: Yongjun Zhang


https://issues.apache.org/jira/browse/HDFS-9231?focusedCommentId=14975559=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14975559

stated -1 for findbugs, however, 

https://builds.apache.org/job/PreCommit-HDFS-Build/13205/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html

says 0.

Thanks a lot for looking into.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-12103) Small refactoring of DelegationTokenAuthenticationFilter to allow code sharing

2015-06-18 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-12103:
--

 Summary: Small refactoring of DelegationTokenAuthenticationFilter 
to allow code sharing
 Key: HADOOP-12103
 URL: https://issues.apache.org/jira/browse/HADOOP-12103
 Project: Hadoop Common
  Issue Type: Bug
  Components: security
Affects Versions: 2.7.1
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Priority: Minor


This is the hadoop-common portion change for HDFS-8337 patch rev 003.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11597) Factor OSType out from Shell: change in common

2015-02-15 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11597:
--

 Summary: Factor OSType out from Shell: change in common
 Key: HADOOP-11597
 URL: https://issues.apache.org/jira/browse/HADOOP-11597
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: util
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11551) Let nightly jenkins jobs run the tool of HADOOP-11045 and include the result in the job report

2015-02-05 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11551:
--

 Summary: Let nightly jenkins jobs run the tool of HADOOP-11045 and 
include the result in the job report
 Key: HADOOP-11551
 URL: https://issues.apache.org/jira/browse/HADOOP-11551
 Project: Hadoop Common
  Issue Type: Bug
  Components: build, tools
Reporter: Yongjun Zhang


This jira is to propose running the tool created with HADOOP-11045 at the end 
of jenkins test job - I am thinking about trunk jobs currently - and report the 
results at the job report. This way when we look at test failure, we can tell 
the failure pattern, and whether failed test is likely a flaky test or not.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11408) TestRetryCacheWithHA.testUpdatePipeline failed in trunk

2014-12-15 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11408:
--

 Summary: TestRetryCacheWithHA.testUpdatePipeline failed in trunk
 Key: HADOOP-11408
 URL: https://issues.apache.org/jira/browse/HADOOP-11408
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang


https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport/

Error Message
{quote}
After waiting the operation updatePipeline still has not taken effect on NN yet
Stacktrace

java.lang.AssertionError: After waiting the operation updatePipeline still has 
not taken effect on NN yet
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testClientRetryWithFailover(TestRetryCacheWithHA.java:1278)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline(TestRetryCacheWithHA.java:1176)
{quote}

Found by tool proposed in HADOOP-11045:

{quote}
[yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
Hadoop-Hdfs-trunk -n 28 | tee bt.log
Recently FAILED builds in url: 
https://builds.apache.org//job/Hadoop-Hdfs-trunk
THERE ARE 4 builds (out of 6) that have failed tests in the past 28 days, 
as listed below:

===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport (2014-12-15 
03:30:01)
Failed test: 
org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
Failed test: 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1972/testReport (2014-12-13 
10:32:27)
Failed test: 
org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1971/testReport (2014-12-13 
03:30:01)
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1969/testReport (2014-12-11 
03:30:01)
Failed test: 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization

Among 6 runs examined, all failed tests #failedRuns: testName:
3: 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
2: org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
2: 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
1: 
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HADOOP-11408) TestRetryCacheWithHA.testUpdatePipeline failed in trunk

2014-12-15 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-11408.

Resolution: Duplicate

 TestRetryCacheWithHA.testUpdatePipeline failed in trunk
 ---

 Key: HADOOP-11408
 URL: https://issues.apache.org/jira/browse/HADOOP-11408
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang

 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport/
 Error Message
 {quote}
 After waiting the operation updatePipeline still has not taken effect on NN 
 yet
 Stacktrace
 java.lang.AssertionError: After waiting the operation updatePipeline still 
 has not taken effect on NN yet
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testClientRetryWithFailover(TestRetryCacheWithHA.java:1278)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline(TestRetryCacheWithHA.java:1176)
 {quote}
 Found by tool proposed in HADOOP-11045:
 {quote}
 [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
 Hadoop-Hdfs-trunk -n 5 | tee bt.log
 Recently FAILED builds in url: 
 https://builds.apache.org//job/Hadoop-Hdfs-trunk
 THERE ARE 4 builds (out of 6) that have failed tests in the past 5 days, 
 as listed below:
 ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport 
 (2014-12-15 03:30:01)
 Failed test: 
 org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
 Failed test: 
 org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
 Failed test: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
 ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1972/testReport 
 (2014-12-13 10:32:27)
 Failed test: 
 org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
 ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1971/testReport 
 (2014-12-13 03:30:01)
 Failed test: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
 ===https://builds.apache.org/job/Hadoop-Hdfs-trunk/1969/testReport 
 (2014-12-11 03:30:01)
 Failed test: 
 org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
 Failed test: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
 Failed test: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization
 Among 6 runs examined, all failed tests #failedRuns: testName:
 3: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
 2: org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
 2: 
 org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
 1: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HADOOP-11320) Submitting a hadoop patch doesn't trigger jenkins test run

2014-12-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang reopened HADOOP-11320:


 Submitting a hadoop patch doesn't trigger jenkins test run
 --

 Key: HADOOP-11320
 URL: https://issues.apache.org/jira/browse/HADOOP-11320
 Project: Hadoop Common
  Issue Type: Bug
  Components: build
Reporter: Yongjun Zhang
 Attachments: HADOOP-11293.003.patch


 See details in INFRA-8655.
 Per [~abayer] and [~cnauroth]'s feedback there , I'm creating this jira to 
 investigate the possible bug in dev-support/test-patch.sh script.
 Thanks Andrew and Chris.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11320) Submitting a hadoop patch doesn't trigger jenkins test run

2014-11-19 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11320:
--

 Summary: Submitting a hadoop patch doesn't trigger jenkins test run
 Key: HADOOP-11320
 URL: https://issues.apache.org/jira/browse/HADOOP-11320
 Project: Hadoop Common
  Issue Type: Bug
  Components: build
Reporter: Yongjun Zhang


See details in INFRA-8655.

Per [~abayer] and [~cnauroth]'s feedback there , I'm creating this jira to 
investigate the possible bug in dev-support/test-patch.sh script.

Thanks Andrew and Chris.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11293) Factor OSType out from Shell

2014-11-10 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11293:
--

 Summary: Factor OSType out from Shell
 Key: HADOOP-11293
 URL: https://issues.apache.org/jira/browse/HADOOP-11293
 Project: Hadoop Common
  Issue Type: Improvement
  Components: util
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


Currently the code that detects the OS type is located in Shell.java. Code that 
need to check OS type refers to Shell, even if no other stuff of Shell is 
needed. 

I am proposing to refactor OSType out to  its own class, so to make the OSType 
easier to access and the dependency cleaner.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11208) Replace daemon with better name in scripts like hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs

2014-10-17 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11208:
--

 Summary: Replace daemon with better name in scripts like 
hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
 Key: HADOOP-11208
 URL: https://issues.apache.org/jira/browse/HADOOP-11208
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Yongjun Zhang


Per discussion in HDFS-7204, creating this jira.

Thanks [~aw] for the work on HDFS-7204.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11195) Move Id-Name mapping in NFS to the hadoop-common area for better maintenance

2014-10-13 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11195:
--

 Summary: Move Id-Name mapping in NFS to the hadoop-common area for 
better maintenance
 Key: HADOOP-11195
 URL: https://issues.apache.org/jira/browse/HADOOP-11195
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


Per [~aw]'s suggestion in HDFS-7146, creating this jira to move the id-name 
mapping implementation (IdUserGroup.java) to the framework that cache user and 
group info in hadoop-common area 
(hadoop-common/src/main/java/org/apache/hadoop/security) 

Thanks [~brandonli] and [~aw] for the review and discussion in HDFS-7146.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11189) TestDNFencing.testQueueingWithAppend failed often in latest test

2014-10-10 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11189:
--

 Summary: TestDNFencing.testQueueingWithAppend failed often in 
latest test
 Key: HADOOP-11189
 URL: https://issues.apache.org/jira/browse/HADOOP-11189
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Reporter: Yongjun Zhang


Using tool from HADOOP-11045, got the following report:

{code}
[yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
PreCommit-HDFS-Build -n 1 

Recently FAILED builds in url: 
https://builds.apache.org//job/PreCommit-HDFS-Build
THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as 
listed below:

===https://builds.apache.org/job/PreCommit-HDFS-Build/8390/testReport 
(2014-10-10 05:20:58)
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress
Failed test: 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots
===https://builds.apache.org/job/PreCommit-HDFS-Build/8389/testReport 
(2014-10-10 01:10:58)
No failed tests in testReport, check job's Console Output for why it was 
reported failed
===https://builds.apache.org/job/PreCommit-HDFS-Build/8388/testReport 
(2014-10-10 00:30:54)
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
Failed test: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress
..
Among 9 runs examined, all failed tests #failedRuns: testName:
7: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
6: 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress
3: 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots
1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen
1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching
..
{code}

TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. 

Creating this jira for TestDNFencing.testQueueingWithAppend.

Symptom:
{code}
Failed

org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend

Failing for the past 1 build (Since Failed#8390 )
Took 2.9 sec.
Error Message

expected:18 but was:12
Stacktrace

java.lang.AssertionError: expected:18 but was:12
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HADOOP-11189) TestDNFencing.testQueueingWithAppend failed often in latest test

2014-10-10 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-11189.

Resolution: Duplicate

 TestDNFencing.testQueueingWithAppend failed often in latest test
 

 Key: HADOOP-11189
 URL: https://issues.apache.org/jira/browse/HADOOP-11189
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Reporter: Yongjun Zhang

 Using tool from HADOOP-11045, got the following report:
 {code}
 [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
 PreCommit-HDFS-Build -n 1 
 Recently FAILED builds in url: 
 https://builds.apache.org//job/PreCommit-HDFS-Build
 THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, 
 as listed below:
 ..
 Among 9 runs examined, all failed tests #failedRuns: testName:
 7: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
 6: 
 org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress
 3: 
 org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots
 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen
 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching
 ..
 {code}
 TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. 
 Creating this jira for TestDNFencing.testQueueingWithAppend.
 Symptom:
 {code}
 Failed
 org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
 Failing for the past 1 build (Since Failed#8390 )
 Took 2.9 sec.
 Error Message
 expected:18 but was:12
 Stacktrace
 java.lang.AssertionError: expected:18 but was:12
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11056) OsSecureRandom.setConf() might leak resource

2014-09-03 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11056:
--

 Summary: OsSecureRandom.setConf() might leak resource
 Key: HADOOP-11056
 URL: https://issues.apache.org/jira/browse/HADOOP-11056
 Project: Hadoop Common
  Issue Type: Bug
  Components: security
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


OsSecureRandom.setConf() might leak resource, if {{fillReservoir(0);}} throw 
exception, the steam is not closed.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11045) Introducing a tool to detect flaky tests of hadoop jenkins test job

2014-09-01 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-11045:
--

 Summary: Introducing a tool to detect flaky tests of hadoop 
jenkins test job
 Key: HADOOP-11045
 URL: https://issues.apache.org/jira/browse/HADOOP-11045
 Project: Hadoop Common
  Issue Type: Improvement
  Components: build, tools
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


File this jira to introduce a tool to detect flaky tests of hadoop jenkins test 
jobs. 

I developed the tool on top of some initial work [~tlipcon] did. We find it 
quite useful. With Todd's agreement, I'd like to push it to upstream so all of 
us can share (thanks Todd for the initial work and support). I hope you find 
the tool useful.

This is a tool for hadoop contributors rather than hadoop users. Thanks 
[~tedyu] for the advice to put to dev-support dir.

Description of the tool:

#
# Given a jenkins test job, this script examines all runs of the job done
# within specified period of time (number of days prior to the execution
# time of this script), and reports all failed tests.
#
# The output of this script includes a section for each run that has failed
# tests, with each failed test name listed.
#
# More importantly, at the end, it outputs a summary section to list all failed
# tests within all examined runs, and indicate how many runs a same test
# failed, and sorted all failed tests by how many runs each test failed in.
#
# This way, when we see failed tests in PreCommit build, we can quickly tell 
# whether a failed test is a new failure or it failed before, and it may just 
# be a flaky test.
#
# Of course, to be 100% sure about the reason of a failed test, closer look 
# at the failed test for the specific run is necessary.
#




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-10888) org.apache.hadoop.ipc.TestIPC.testRetryProxy failed often with timeout

2014-07-23 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-10888:
--

 Summary: org.apache.hadoop.ipc.TestIPC.testRetryProxy failed often 
with timeout
 Key: HADOOP-10888
 URL: https://issues.apache.org/jira/browse/HADOOP-10888
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Yongjun Zhang


As an example, 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4333//testReport/org.apache.hadoop.ipc/TestIPC/testRetryProxy/
{code}
Error Message

test timed out after 6 milliseconds

Stacktrace

java.lang.Exception: test timed out after 6 milliseconds
at java.net.Inet4AddressImpl.getLocalHostName(Native Method)
at java.net.InetAddress.getLocalHost(InetAddress.java:1374)
at org.apache.hadoop.net.NetUtils.getConnectAddress(NetUtils.java:372)
at org.apache.hadoop.net.NetUtils.getConnectAddress(NetUtils.java:359)
at 
org.apache.hadoop.ipc.TestIPC$TestInvocationHandler.invoke(TestIPC.java:212)
at org.apache.hadoop.ipc.$Proxy11.dummyRun(Unknown Source)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at org.apache.hadoop.ipc.$Proxy11.dummyRun(Unknown Source)
at org.apache.hadoop.ipc.TestIPC.testRetryProxy(TestIPC.java:1060)
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10889) Fix misuse of test.build.data in various places

2014-07-23 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-10889:
--

 Summary: Fix misuse of test.build.data in various places
 Key: HADOOP-10889
 URL: https://issues.apache.org/jira/browse/HADOOP-10889
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


Per [~arpitagarwal]'s comments in HDFS-6719, I'm filing this jira as a 
follow-up. The goal is to fix the misuse of test.build.data over quite some 
places. Thanks Arpit!
{code}
FSTestWrapper.java
FileContextMainOperationsBaseTest.java
FileContextTestHelper.java
FileContextURIBase.java
FileSystemTestHelper.java
MiniDFSCluster.java
TestBlocksWithNotEnoughRacks.java
TestChecksumFileSystem.java
TestCopyPreserveFlag.java
TestCreateEditsLog.java
TestDFSUpgradeFromImage.java
TestDecommissioningStatus.java
TestEnhancedByteBufferAccess.java
TestFSImageWithSnapshot.java
TestFileUtil.java
TestFsShellReturnCode.java
TestHadoopArchives.java
TestHarFileSystemBasics.java
TestHardLink.java
TestHdfsTextCommand.java
TestHostsFiles.java
TestJHLA.java
TestListFiles.java
TestLocalFileSystem.java
TestNameNodeRecovery.java
TestNativeIO.java
TestPathData.java
TestPread.java
TestRenameWithSnapshots.java
TestSeekBug.java
TestSlive.java
TestSnapshot.java
TestStartup.java
TestTextCommand.java
etc
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10872) org.apache.hadoop.fs.shell.TestPathData failed intermittently with Mkdirs failed to create d1

2014-07-22 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-10872:
--

 Summary: org.apache.hadoop.fs.shell.TestPathData failed 
intermittently with Mkdirs failed to create d1
 Key: HADOOP-10872
 URL: https://issues.apache.org/jira/browse/HADOOP-10872
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 2.5.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


A bunch of TestPathData tests failed intermittently, e.g.
https://builds.apache.org/job/PreCommit-HDFS-Build/7416//testReport/

Example failure log:
{code}
Failed

org.apache.hadoop.fs.shell.TestPathData.testUnqualifiedUriContents
Failing for the past 1 build (Since Failed#7416 )
Took 0.46 sec.
Error Message

Mkdirs failed to create d1

Stacktrace

java.io.IOException: Mkdirs failed to create d1
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at 
org.apache.hadoop.fs.shell.TestPathData.initialize(TestPathData.java:54)
{code}





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HADOOP-10510) TestSymlinkLocalFSFileContext tests are failing

2014-07-21 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-10510.


Resolution: Duplicate

I'm marking this as a duplicate per [~andrew.wang]'s comments in HADOOP-10866. 
Thanks to you all!


 TestSymlinkLocalFSFileContext tests are failing
 ---

 Key: HADOOP-10510
 URL: https://issues.apache.org/jira/browse/HADOOP-10510
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 2.4.0
 Environment: Linux
Reporter: Daniel Darabos
 Attachments: TestSymlinkLocalFSFileContext-output.txt, 
 TestSymlinkLocalFSFileContext.txt


 Test results:
 https://gist.github.com/oza/9965197
 This was mentioned on hadoop-common-dev:
 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201404.mbox/%3CCAAD07OKRSmx9VSjmfk1YxyBmnFM8mwZSp%3DizP8yKKwoXYvn3Qg%40mail.gmail.com%3E
 Can you suggest a workaround in the meantime? I'd like to send a pull request 
 for an unrelated bug, but these failures mean I cannot build hadoop-common to 
 test my fix. Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10543) RemoteException's unwrapRemoteException method failed for PathIOException

2014-04-27 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-10543:
--

 Summary: RemoteException's unwrapRemoteException method failed for 
PathIOException
 Key: HADOOP-10543
 URL: https://issues.apache.org/jira/browse/HADOOP-10543
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


If the cause of a RemoteException is PathIOException, RemoteException's 
unwrapRemoteException methods would fail, because PathIOException overwrites 
the cause to null, which makes Throwable to throw exception at
{code}
public synchronized Throwable initCause(Throwable cause) {
if (this.cause != this)
throw new IllegalStateException(Can't overwrite cause);
{code} 




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HADOOP-10293) Though symlink is disabled by default, related code interprets path to be link incorrectly

2014-01-31 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HADOOP-10293.


Resolution: Fixed

See commit log for 
Addendum patch for HADOOP-9652 to fix performance problems. Contributed by 
Andrew Wang


 Though symlink is disabled by default,  related code interprets path to be 
 link incorrectly
 ---

 Key: HADOOP-10293
 URL: https://issues.apache.org/jira/browse/HADOOP-10293
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 2.3.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang

 File path ...xyz/abc`/tfile is interpreted as a link, due to the existence of 
 backtick in the file path. abc` is a directory name here.
 There are two issues here, 
 1. When symlink is disabled, the code that interprets symlink should be 
 disabled too. This is the issue to resolve in this jira.
 2. When symlink is enabled, using of backtick ` as delimiter to interpret 
 whether a path is link need to be revisited, will file a different JIRA.
  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HADOOP-10250) VersionUtil returns wrong value when comparing two versions

2014-01-21 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HADOOP-10250:
--

 Summary: VersionUtil returns wrong value when comparing two 
versions
 Key: HADOOP-10250
 URL: https://issues.apache.org/jira/browse/HADOOP-10250
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Yongjun Zhang


VersionUtil.compareVersions(1.0.0-beta-1, 1.0.0) returns 7 instead of 
negative number, which is wrong, because 1.0.0-beta-1 older than 1.0.0.


 




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)