date:20150215

Chengbing Liu created HDFS-7798:
---

 Summary: Checkpointing failure caused by shared 
KerberosAuthenticator
 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Priority: Critical


We have observed in our real cluster occasionally checkpointing failure. The 
standby NameNode was not able to upload image to the active NameNode.

After some digging, the root cause appears to be a shared 
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
designed as a use-once instance, and is not stateless. It has attributes such 
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going 
to have race condition, resulting in a failed image uploading.

Therefore for the first step, without breaking the current API, I propose we 
create a new {{KerberosAuthenticator}} instance for each connection, to make 
checkpointing work. We may consider making {{Authenticator}} design and 
implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7800) Improve documentation for FileSystem.concat()

2015-02-15 Thread Steve Armstrong (JIRA)

Steve Armstrong created HDFS-7800:
-

 Summary: Improve documentation for FileSystem.concat()
 Key: HDFS-7800
 URL: https://issues.apache.org/jira/browse/HDFS-7800
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 2.6.0, 2.2.0
Reporter: Steve Armstrong


This is a documentation request.

[FileSystem.concat()|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html]
 says it will Concat existing files together. It seems to be a Namenode-only 
operation though, mapping the data blocks into a single file. This means:

# The destination must exist
# The destination must be non-empty
# The destination must have it's last block exactly full
# All but the last of the source files must have their last block full
# All the source file will be deleted by this operation

HDFS-6641 brought up some of these limitations, but was closed as not a 
problem. I think the javadoc should be improved so it's clear this function was 
never intended to work the same as a general purpose file concatenation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator

2015-02-15 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321801#comment-14321801
 ] 

Hadoop QA commented on HDFS-7798:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12698944/HDFS-7798.01.patch
  against trunk revision 3338f6d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9587//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9587//console

This message is automatically generated.

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Assignee: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6081) TestRetryCacheWithHA#testCreateSymlink occasionally fails in trunk


 [ 
https://issues.apache.org/jira/browse/HDFS-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved HDFS-6081.
--
Resolution: Cannot Reproduce

 TestRetryCacheWithHA#testCreateSymlink occasionally fails in trunk
 --

 Key: HDFS-6081
 URL: https://issues.apache.org/jira/browse/HDFS-6081
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Ted Yu

 From 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1696/testReport/junit/org.apache.hadoop.hdfs.server.namenode.ha/TestRetryCacheWithHA/testCreateSymlink/
  :
 {code}
 2014-03-09 13:18:47,515 WARN  security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1600)) - PriviledgedActionException 
 as:jenkins (auth:SIMPLE) cause:java.io.IOException: failed to create link 
 /testlink either because the filename is invalid or the file exists
 2014-03-09 13:18:47,515 INFO  ipc.Server (Server.java:run(2093)) - IPC Server 
 handler 0 on 39303, call 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.createSymlink from 
 127.0.0.1:32909 Call#682 Retry#1: error: java.io.IOException: failed to 
 create link /testlink either because the filename is invalid or the file 
 exists
 java.io.IOException: failed to create link /testlink either because the 
 filename is invalid or the file exists
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlinkInt(FSNamesystem.java:2053)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlink(FSNamesystem.java:2023)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSymlink(NameNodeRpcServer.java:965)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSymlink(ClientNamenodeProtocolServerSideTranslatorPB.java:844)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1597)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
 2014-03-09 13:18:47,522 INFO  blockmanagement.BlockManager 
 (BlockManager.java:processMisReplicatesAsync(2475)) - Total number of blocks  
   = 1
 2014-03-09 13:18:47,523 INFO  blockmanagement.BlockManager 
 (BlockManager.java:processMisReplicatesAsync(2476)) - Number of invalid 
 blocks  = 0
 2014-03-09 13:18:47,523 INFO  blockmanagement.BlockManager 
 (BlockManager.java:processMisReplicatesAsync(2477)) - Number of 
 under-replicated blocks = 0
 2014-03-09 13:18:47,523 INFO  ha.TestRetryCacheWithHA 
 (TestRetryCacheWithHA.java:run(1162)) - Got Exception while calling 
 createSymlink
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): failed to create 
 link /testlink either because the filename is invalid or the file exists
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlinkInt(FSNamesystem.java:2053)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlink(FSNamesystem.java:2023)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSymlink(NameNodeRpcServer.java:965)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSymlink(ClientNamenodeProtocolServerSideTranslatorPB.java:844)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1597)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
   at org.apache.hadoop.ipc.Client.call(Client.java:1409)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy17.createSymlink(Unknown Source)
   at

[jira] [Resolved] (HDFS-6177) TestHttpFSServer fails occasionally in trunk


 [ 
https://issues.apache.org/jira/browse/HDFS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved HDFS-6177.
--
Resolution: Cannot Reproduce

 TestHttpFSServer fails occasionally in trunk
 

 Key: HDFS-6177
 URL: https://issues.apache.org/jira/browse/HDFS-6177
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor

 From https://builds.apache.org/job/Hadoop-hdfs-trunk/1716/consoleFull :
 {code}
 Running org.apache.hadoop.fs.http.server.TestHttpFSServer
 Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.424 sec  
 FAILURE! - in org.apache.hadoop.fs.http.server.TestHttpFSServer
 testDelegationTokenOperations(org.apache.hadoop.fs.http.server.TestHttpFSServer)
   Time elapsed: 0.559 sec   FAILURE!
 java.lang.AssertionError: expected:401 but was:403
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at org.junit.Assert.assertEquals(Assert.java:456)
   at 
 org.apache.hadoop.fs.http.server.TestHttpFSServer.testDelegationTokenOperations(TestHttpFSServer.java:352)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned

2015-02-15 Thread Frode Halvorsen (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321968#comment-14321968
]

Frode Halvorsen commented on HDFS-7787:
---

Sorry- my grep was wrong, and included a lot of replications for earlier times,
but it was still on the same decom-node.

The correct stats for the 10 minutes between 13:00 and 13.10 today is :
a total of 3161 started threads. None of thos was for blocks with two live
replicas, but 2430 was for blocks with one live replica and only 731 was blocks
without live replicas.
That means that only 1/4 of the blocks replicated was of the 'highest
priority'. And of course this made my day worse ; I now hae to wait one month
befor I can take down the node...

Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority
to blocks on nodes being decomissioned
--

Key: HDFS-7787
URL: https://issues.apache.org/jira/browse/HDFS-7787
Project: Hadoop HDFS
Issue Type: Improvement
Components: datanode
Affects Versions: 2.6.0
Environment: 2 namenodes HA, 6 datanodes in two racks
Reporter: Frode Halvorsen
Labels: balance, hdfs, replication-performance

Each file has a setting of 3 replicas. split on different racks.
After a simulated crash of one rack (shutdown of all nodes, deleted
data-directory an started nodes) and decommssion of one of the nodes in the
orther rack the replication does not follow 'normal' rules...
My cluster has appx 25 mill files, and the one node I now try to decommision
has 9 millions underreplicated blocks, and 3,5 million blocks with 'no live
replicas'. After a restart of the node, it starts to replicate both types of
blocks, but after a while, it only repliates under-replicated blocks with
other live copies. I would think that the 'normal' way to do this would be to
make sure that all blocks this node keeps the only copy of, should be the
first to be replicated/balanced ?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned

2015-02-15 Thread Frode Halvorsen (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321963#comment-14321963
 ] 

Frode Halvorsen commented on HDFS-7787:
---

I just did a log-analysis of the decommissioning  node, and looked at what it 
actually started to replicate during av ten-minute period. I filtered on the 
log-lines for 'Staring thread to transefer' and counted lines divided into 
replication to one, two or three nodes (blocks with 2, 1 and 0 live replicas). 
It started 5036 threads during the 10 minutes I loked at, and it was :
53 blokcs to one node (2 live replicas in the cluster)
3127 blocks to two nodes (blocks with one live replica)
1856 blocks to three nodes (blocks with no live replicas)


Of course this is a problem for me, as I won't be able to kill the node totally 
before all blocks with no live replicas has been transfered. It's still 3.3 
million of them, and at this rate I won't be able to kille the node for another 
week and a half :(


 Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority 
 to blocks on nodes being decomissioned
 --

 Key: HDFS-7787
 URL: https://issues.apache.org/jira/browse/HDFS-7787
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.6.0
 Environment: 2 namenodes HA, 6 datanodes in two racks
Reporter: Frode Halvorsen
  Labels: balance, hdfs, replication-performance

 Each file has a setting of 3 replicas. split on different racks.
 After a simulated crash of one rack (shutdown of all nodes, deleted 
 data-directory an started nodes) and decommssion of one of the nodes in the 
 orther rack the replication does not follow 'normal' rules...
 My cluster has appx 25 mill files, and the one node I now try to decommision 
 has 9 millions underreplicated blocks, and 3,5 million blocks with 'no live 
 replicas'. After a restart of the node, it starts to replicate both types of 
 blocks, but after a while, it only repliates under-replicated blocks with 
 other live copies. I would think that the 'normal' way to do this would be to 
 make sure that all blocks this node keeps the only copy of, should be the 
 first to be replicated/balanced ?  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6501) TestCrcCorruption#testCorruptionDuringWrt sometimes fails in trunk


 [ 
https://issues.apache.org/jira/browse/HDFS-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved HDFS-6501.
--
Resolution: Cannot Reproduce

 TestCrcCorruption#testCorruptionDuringWrt sometimes fails in trunk
 --

 Key: HDFS-6501
 URL: https://issues.apache.org/jira/browse/HDFS-6501
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor

 From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1767/ :
 {code}
 REGRESSION:  org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
 Error Message:
 test timed out after 5 milliseconds
 Stack Trace:
 java.lang.Exception: test timed out after 5 milliseconds
 at java.lang.Object.wait(Native Method)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107)
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 at 
 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98)
 at 
 org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6726) TestNamenodeCapacityReport fails intermittently


 [ 
https://issues.apache.org/jira/browse/HDFS-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved HDFS-6726.
--
Resolution: Cannot Reproduce

 TestNamenodeCapacityReport fails intermittently
 ---

 Key: HDFS-6726
 URL: https://issues.apache.org/jira/browse/HDFS-6726
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor

 From 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1812/testReport/junit/org.apache.hadoop.hdfs.server.namenode/TestNamenodeCapacityReport/testXceiverCount/
  :
 {code}
 java.io.IOException: Unable to close file because the last block does not 
 have enough number of replicas.
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2141)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2109)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport.testXceiverCount(TestNamenodeCapacityReport.java:281)
 {code}
 There were multiple occurrences of 'Broken pipe', 'Connection reset by peer' 
 and 'Premature EOF from inputStream' exceptions in test output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned

2015-02-15 Thread Frode Halvorsen (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322120#comment-14322120
]

Frode Halvorsen commented on HDFS-7787:
---

And later, it seems to get better :) Now (after another paramater-tuning for
faster replication) it replicates 43.000 blocks / hour. And every block is one
that has zero live replicas in the cluster :) It actually seems that the
name-nodes needs time to calculate which blocks has higher priority. Now I only
need three more days before I can take down the data-node :)

As it turns out, it might just be parameteres that made me believe that it had
a bad prioritizing-algorithm :) Too bad a lot of the parameters I now have
changed is undocumented, but 'revealed' in different forum-postings...

A quick look at logs on the active name-node reveals that it actually only ask
the decommissioning node to replicate. No other nodes is contacted, thus it now
only replicates nodes with no live replicas. It might be my parameter-settings,
but it could actually have asked any of the other 5 datanodes to replicate the
blocks with one live replica... I'll try to add even more replication-requests
per heartbeat to see if it is able to make the other datanodes do any work as
well.

Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority
to blocks on nodes being decomissioned
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator


 [ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated HDFS-7798:

Attachment: HDFS-7798.01.patch

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7800) Improve documentation for FileSystem.concat()

2015-02-15 Thread Steve Armstrong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Armstrong updated HDFS-7800:
--
Attachment: HDFS-7800-1.patch

 Improve documentation for FileSystem.concat()
 -

 Key: HDFS-7800
 URL: https://issues.apache.org/jira/browse/HDFS-7800
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 2.2.0, 2.6.0
Reporter: Steve Armstrong
 Attachments: HDFS-7800-1.patch


 This is a documentation request.
 [FileSystem.concat()|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html]
  says it will Concat existing files together. It seems to be a 
 Namenode-only operation though, mapping the data blocks into a single file. 
 This means:
 # The destination must exist
 # The destination must be non-empty
 # The destination must have it's last block exactly full
 # All but the last of the source files must have their last block full
 # All the source file will be deleted by this operation
 HDFS-6641 brought up some of these limitations, but was closed as not a 
 problem. I think the javadoc should be improved so it's clear this function 
 was never intended to work the same as a general purpose file concatenation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator


 [ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated HDFS-7798:

Description: 
We have observed in our real cluster occasional checkpointing failure. The 
standby NameNode was not able to upload image to the active NameNode.

After some digging, the root cause appears to be a shared 
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
designed as a use-once instance, and is not stateless. It has attributes such 
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going 
to have race condition, resulting in a failed image uploading.

Therefore for the first step, without breaking the current API, I propose we 
create a new {{KerberosAuthenticator}} instance for each connection, to make 
checkpointing work. We may consider making {{Authenticator}} design and 
implementation stateless afterwards, as {{ConnectionConfigurator}} does.

  was:
We have observed in our real cluster occasionally checkpointing failure. The 
standby NameNode was not able to upload image to the active NameNode.

After some digging, the root cause appears to be a shared 
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
designed as a use-once instance, and is not stateless. It has attributes such 
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going 
to have race condition, resulting in a failed image uploading.

Therefore for the first step, without breaking the current API, I propose we 
create a new {{KerberosAuthenticator}} instance for each connection, to make 
checkpointing work. We may consider making {{Authenticator}} design and 
implementation stateless afterwards, as {{ConnectionConfigurator}} does.


 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Priority: Critical

 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7799) Factor OSType out from Shell: change in HDFS

2015-02-15 Thread Yongjun Zhang (JIRA)

Yongjun Zhang created HDFS-7799:
---

 Summary: Factor OSType out from Shell: change in HDFS
 Key: HDFS-7799
 URL: https://issues.apache.org/jira/browse/HDFS-7799
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator


 [ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated HDFS-7798:

Assignee: Chengbing Liu
  Status: Patch Available  (was: Open)

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Assignee: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7537) fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart

2015-02-15 Thread GAO Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322278#comment-14322278
 ] 

GAO Rui commented on HDFS-7537:
---

I want to try to do this. Please assign it to me, thank you!

 fsck is confusing when dfs.namenode.replication.min  1  missing replicas 
  NN restart
 -

 Key: HDFS-7537
 URL: https://issues.apache.org/jira/browse/HDFS-7537
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Allen Wittenauer
 Attachments: dfs-min-2-fsck.png, dfs-min-2.png


 If minimum replication is set to 2 or higher and some of those replicas are 
 missing and the namenode restarts, it isn't always obvious that the missing 
 replicas are the reason why the namenode isn't leaving safemode.  We should 
 improve the output of fsck and the web UI to make it obvious that the missing 
 blocks are from unmet replicas vs. completely/totally missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator

2015-02-15 Thread Yi Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322279#comment-14322279
 ] 

Yi Liu commented on HDFS-7798:
--

Yes, {{KerberosAuthenticator}} is not thread-safe, we need to create a new 
instance when opening a new connection.
+1

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator


[ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321793#comment-14321793
 ] 

Chengbing Liu commented on HDFS-7798:
-

The checkpointing failure happens when image uploading and edit log fetching 
comes at the same time.

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Chengbing Liu
Assignee: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator


 [ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated HDFS-7798:

Affects Version/s: 2.6.0

 Checkpointing failure caused by shared KerberosAuthenticator
 

 Key: HDFS-7798
 URL: https://issues.apache.org/jira/browse/HDFS-7798
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
Priority: Critical
 Attachments: HDFS-7798.01.patch


 We have observed in our real cluster occasional checkpointing failure. The 
 standby NameNode was not able to upload image to the active NameNode.
 After some digging, the root cause appears to be a shared 
 {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
 designed as a use-once instance, and is not stateless. It has attributes such 
 as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
 {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
 going to have race condition, resulting in a failed image uploading.
 Therefore for the first step, without breaking the current API, I propose we 
 create a new {{KerberosAuthenticator}} instance for each connection, to make 
 checkpointing work. We may consider making {{Authenticator}} design and 
 implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.

2015-02-15 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1436#comment-1436
 ] 

Jitendra Nath Pandey commented on HDFS-7604:


[~cnauroth], thanks for the due diligence to ensure a clean merge to branch-2. 
+1 for the latest patch.

 Track and display failed DataNode storage locations in NameNode.
 

 Key: HDFS-7604
 URL: https://issues.apache.org/jira/browse/HDFS-7604
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, namenode
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, 
 HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, 
 HDFS-7604-screenshot-5.png, HDFS-7604-screenshot-6.png, 
 HDFS-7604-screenshot-7.png, HDFS-7604.001.patch, HDFS-7604.002.patch, 
 HDFS-7604.004.patch, HDFS-7604.005.patch, HDFS-7604.006.patch, 
 HDFS-7604.prototype.patch


 During heartbeats, the DataNode can report a list of its storage locations 
 that have been taken out of service due to failure (such as due to a bad disk 
 or a permissions problem).  The NameNode can track these failed storage 
 locations and then report them in JMX and the NameNode web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks


 [ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su resolved HDFS-7633.
-
Resolution: Done

HDFS-7430. Refactor the BlockScanner and bug code is deleted. 

 BlockPoolSliceScanner fails when Datanode has too many blocks
 -

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks


 [ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-7633:

Status: Open  (was: Patch Available)

HDFS-7430. Refactor the BlockScanner and bug code is deleted

 BlockPoolSliceScanner fails when Datanode has too many blocks
 -

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads


[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322284#comment-14322284
 ] 

Walter Su commented on HDFS-7430:
-

HDFS-7633 bug code is removed.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.7.0

 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks


[ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322283#comment-14322283
 ] 

Walter Su commented on HDFS-7633:
-

HDFS-7430 Refactor the BlockScanner so the bug code HDFS-7633 is deleted. 

 BlockPoolSliceScanner fails when Datanode has too many blocks
 -

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks


 [ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-7633:

Fix Version/s: 2.6.1

 BlockPoolSliceScanner fails when Datanode has too many blocks
 -

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Fix For: 2.6.1

 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7537) fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart

2015-02-15 Thread GAO Rui (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GAO Rui reassigned HDFS-7537:
-

Assignee: GAO Rui

 fsck is confusing when dfs.namenode.replication.min  1  missing replicas 
  NN restart
 -

 Key: HDFS-7537
 URL: https://issues.apache.org/jira/browse/HDFS-7537
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: GAO Rui
 Attachments: dfs-min-2-fsck.png, dfs-min-2.png


 If minimum replication is set to 2 or higher and some of those replicas are 
 missing and the namenode restarts, it isn't always obvious that the missing 
 replicas are the reason why the namenode isn't leaving safemode.  We should 
 improve the output of fsck and the web UI to make it obvious that the missing 
 blocks are from unmet replicas vs. completely/totally missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7648) Verify the datanode directory layout


 [ 
https://issues.apache.org/jira/browse/HDFS-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-7648:
---
Attachment: HDFS-7648-4.patch

 Verify the datanode directory layout
 

 Key: HDFS-7648
 URL: https://issues.apache.org/jira/browse/HDFS-7648
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Tsz Wo Nicholas Sze
Assignee: Rakesh R
 Attachments: HDFS-7648-3.patch, HDFS-7648-4.patch, HDFS-7648.patch, 
 HDFS-7648.patch


 HDFS-6482 changed datanode layout to use block ID to determine the directory 
 to store the block.  We should have some mechanism to verify it.  Either 
 DirectoryScanner or block report generation could do the check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7764) DirectoryScanner should cancel the future tasks when #compileReport throws exception


 [ 
https://issues.apache.org/jira/browse/HDFS-7764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-7764:
---
Issue Type: Improvement  (was: Bug)

 DirectoryScanner should cancel the future tasks when #compileReport throws 
 exception
 

 Key: HDFS-7764
 URL: https://issues.apache.org/jira/browse/HDFS-7764
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.7.0
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-7764.patch


 If there is an exception while preparing the ScanInfo for the blocks in the 
 directory, DirectoryScanner is immediately throwing exception and coming out 
 of the current scan cycle. It would be good if he can signal #cancel() to the 
 other pending tasks .
 DirectoryScanner.java
 {code}
 for (EntryInteger, FutureScanInfoPerBlockPool report :
 compilersInProgress.entrySet()) {
   try {
 dirReports[report.getKey()] = report.getValue().get();
   } catch (Exception ex) {
 LOG.error(Error compiling report, ex);
 // Propagate ex to DataBlockScanner to deal with
 throw new RuntimeException(ex);
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7765) FSOutputSummer throwing ArrayIndexOutOfBoundsException

2015-02-15 Thread surendra singh lilhore (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

surendra singh lilhore reassigned HDFS-7765:


Assignee: surendra singh lilhore

 FSOutputSummer throwing ArrayIndexOutOfBoundsException
 --

 Key: HDFS-7765
 URL: https://issues.apache.org/jira/browse/HDFS-7765
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.6.0
 Environment: Centos 6, Open JDK 7, Amazon EC2, Accumulo 1.6.2RC4
Reporter: Keith Turner
Assignee: surendra singh lilhore

 While running an Accumulo test, saw exceptions like the following while 
 trying to write to write ahead log in HDFS. 
 The exception occurrs at 
 [FSOutputSummer.java:76|https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSOutputSummer.java#L76]
  which is attempting to update a byte array.
 {noformat}
 2015-02-06 19:46:49,769 [log.DfsLogger] WARN : Exception syncing 
 java.lang.reflect.InvocationTargetException
 java.lang.ArrayIndexOutOfBoundsException: 4608
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:76)
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:50)
 at java.io.DataOutputStream.write(DataOutputStream.java:88)
 at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
 at 
 org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87)
 at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:526)
 at 
 org.apache.accumulo.tserver.log.DfsLogger.logFileData(DfsLogger.java:540)
 at 
 org.apache.accumulo.tserver.log.DfsLogger.logManyTablets(DfsLogger.java:573)
 at 
 org.apache.accumulo.tserver.log.TabletServerLogger$6.write(TabletServerLogger.java:373)
 at 
 org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:274)
 at 
 org.apache.accumulo.tserver.log.TabletServerLogger.logManyTablets(TabletServerLogger.java:365)
 at 
 org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:1667)
 at 
 org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.closeUpdate(TabletServer.java:1754)
 at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
 at 
 org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47)
 at com.sun.proxy.$Proxy22.closeUpdate(Unknown Source)
 at 
 org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2370)
 at 
 org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2354)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
 at 
 org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
 at 
 org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 
 org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
 at 
 org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
 at java.lang.Thread.run(Thread.java:744)
 2015-02-06 19:46:49,769 [log.DfsLogger] WARN : Exception syncing 
 java.lang.reflect.InvocationTargetException
 2015-02-06 19:46:49,772 [log.DfsLogger] ERROR: 
 java.lang.ArrayIndexOutOfBoundsException: 4609
 java.lang.ArrayIndexOutOfBoundsException: 4609
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:76)
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:50)
 at java.io.DataOutputStream.write(DataOutputStream.java:88)
 at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
 at 
 org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87)
 at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:526)
 at

[jira] [Commented] (HDFS-7797) Add audit log for setQuota operation


[ 
https://issues.apache.org/jira/browse/HDFS-7797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322339#comment-14322339
 ] 

Rakesh R commented on HDFS-7797:


The test failure is not related to the patch.

 Add audit log for setQuota operation
 

 Key: HDFS-7797
 URL: https://issues.apache.org/jira/browse/HDFS-7797
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.6.0
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: 001-HDFS-7797.patch


 SetQuota operation should be included in audit log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7801) IOException:NameNode still not started cause DFSClient operation failure without retry.

2015-02-15 Thread zhihai xu (JIRA)

zhihai xu created HDFS-7801:
---

 Summary: IOException:NameNode still not started cause DFSClient 
operation failure without retry.
 Key: HDFS-7801
 URL: https://issues.apache.org/jira/browse/HDFS-7801
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Reporter: zhihai xu


IOException:NameNode still not started cause DFSClient operation failure 
without retry.
In YARN-1778, TestFSRMStateStore failed randomly, it is due to the 
java.io.IOException: NameNode still not started.
The stack trace for this Exception is the following:
{code}
2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not 
started
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)

at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
2015-02-03 00:09:19,089 INFO  [IPC Server handler 0 on 57792] ipc.Server 
(Server.java:run(2155)) - IPC Server handler 0 on 57792, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 
Call#14 Retry#1
java.io.IOException: NameNode still not started
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at

[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum


[ 
https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322400#comment-14322400
 ] 

Rakesh R commented on HDFS-4266:


It seems test case failure is not related to this patch. [~umamaheswararao] 
could you please have a look at the latest patch. Thanks!

 BKJM: Separate write and ack quorum
 ---

 Key: HDFS-4266
 URL: https://issues.apache.org/jira/browse/HDFS-4266
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Reporter: Ivan Kelly
Assignee: Rakesh R
 Attachments: 001-HDFS-4266.patch, 002-HDFS-4266.patch, 
 003-HDFS-4266.patch, 004-HDFS-4266.patch, 005-HDFS-4266.patch


 BOOKKEEPER-208 allows the ack and write quorums to be different sizes to 
 allow writes to be unaffected by any bookie failure. BKJM should be able to 
 take advantage of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6662) [ UI ] Not able to open file from UI if file path contains %

2015-02-15 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322415#comment-14322415
 ] 

Brahma Reddy Battula commented on HDFS-6662:


HI [~gerson23] 
can you please adress [~ajisakaa] comments..? 

 [ UI ] Not able to open file from UI if file path contains %
 --

 Key: HDFS-6662
 URL: https://issues.apache.org/jira/browse/HDFS-6662
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.1
Reporter: Brahma Reddy Battula
Assignee: Gerson Carlos
Priority: Critical
 Attachments: hdfs-6662.001.patch, hdfs-6662.002.patch, hdfs-6662.patch


 1. write a file into HDFS is such a way that, file name is like 1%2%3%4
 2. using NameNode UI browse the file
 throwing following Exception.
 Path does not exist on HDFS or WebHDFS is disabled. Please check your path 
 or enable WebHDFS
 HBase write its WAL  files data in HDFS using % contains in file name
 eg: 
 /hbase/WALs/HOST-,60020,1404731504691/HOST-***-130%2C60020%2C1404731504691.1404812663950.meta
  
 the above file info is not opening in the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7764) DirectoryScanner should cancel the future tasks when #compileReport throws exception