[jira] [Created] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
Chengbing Liu created HDFS-7798: --- Summary: Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Priority: Critical We have observed in our real cluster occasionally checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7800) Improve documentation for FileSystem.concat()
Steve Armstrong created HDFS-7800: - Summary: Improve documentation for FileSystem.concat() Key: HDFS-7800 URL: https://issues.apache.org/jira/browse/HDFS-7800 Project: Hadoop HDFS Issue Type: Task Affects Versions: 2.6.0, 2.2.0 Reporter: Steve Armstrong This is a documentation request. [FileSystem.concat()|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html] says it will Concat existing files together. It seems to be a Namenode-only operation though, mapping the data blocks into a single file. This means: # The destination must exist # The destination must be non-empty # The destination must have it's last block exactly full # All but the last of the source files must have their last block full # All the source file will be deleted by this operation HDFS-6641 brought up some of these limitations, but was closed as not a problem. I think the javadoc should be improved so it's clear this function was never intended to work the same as a general purpose file concatenation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321801#comment-14321801 ] Hadoop QA commented on HDFS-7798: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698944/HDFS-7798.01.patch against trunk revision 3338f6d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9587//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9587//console This message is automatically generated. Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Assignee: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6081) TestRetryCacheWithHA#testCreateSymlink occasionally fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved HDFS-6081. -- Resolution: Cannot Reproduce TestRetryCacheWithHA#testCreateSymlink occasionally fails in trunk -- Key: HDFS-6081 URL: https://issues.apache.org/jira/browse/HDFS-6081 Project: Hadoop HDFS Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1696/testReport/junit/org.apache.hadoop.hdfs.server.namenode.ha/TestRetryCacheWithHA/testCreateSymlink/ : {code} 2014-03-09 13:18:47,515 WARN security.UserGroupInformation (UserGroupInformation.java:doAs(1600)) - PriviledgedActionException as:jenkins (auth:SIMPLE) cause:java.io.IOException: failed to create link /testlink either because the filename is invalid or the file exists 2014-03-09 13:18:47,515 INFO ipc.Server (Server.java:run(2093)) - IPC Server handler 0 on 39303, call org.apache.hadoop.hdfs.protocol.ClientProtocol.createSymlink from 127.0.0.1:32909 Call#682 Retry#1: error: java.io.IOException: failed to create link /testlink either because the filename is invalid or the file exists java.io.IOException: failed to create link /testlink either because the filename is invalid or the file exists at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlinkInt(FSNamesystem.java:2053) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlink(FSNamesystem.java:2023) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSymlink(NameNodeRpcServer.java:965) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSymlink(ClientNamenodeProtocolServerSideTranslatorPB.java:844) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1597) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065) 2014-03-09 13:18:47,522 INFO blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2475)) - Total number of blocks = 1 2014-03-09 13:18:47,523 INFO blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2476)) - Number of invalid blocks = 0 2014-03-09 13:18:47,523 INFO blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2477)) - Number of under-replicated blocks = 0 2014-03-09 13:18:47,523 INFO ha.TestRetryCacheWithHA (TestRetryCacheWithHA.java:run(1162)) - Got Exception while calling createSymlink org.apache.hadoop.ipc.RemoteException(java.io.IOException): failed to create link /testlink either because the filename is invalid or the file exists at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlinkInt(FSNamesystem.java:2053) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSymlink(FSNamesystem.java:2023) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSymlink(NameNodeRpcServer.java:965) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSymlink(ClientNamenodeProtocolServerSideTranslatorPB.java:844) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1597) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065) at org.apache.hadoop.ipc.Client.call(Client.java:1409) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy17.createSymlink(Unknown Source) at
[jira] [Resolved] (HDFS-6177) TestHttpFSServer fails occasionally in trunk
[ https://issues.apache.org/jira/browse/HDFS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved HDFS-6177. -- Resolution: Cannot Reproduce TestHttpFSServer fails occasionally in trunk Key: HDFS-6177 URL: https://issues.apache.org/jira/browse/HDFS-6177 Project: Hadoop HDFS Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-hdfs-trunk/1716/consoleFull : {code} Running org.apache.hadoop.fs.http.server.TestHttpFSServer Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.424 sec FAILURE! - in org.apache.hadoop.fs.http.server.TestHttpFSServer testDelegationTokenOperations(org.apache.hadoop.fs.http.server.TestHttpFSServer) Time elapsed: 0.559 sec FAILURE! java.lang.AssertionError: expected:401 but was:403 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.fs.http.server.TestHttpFSServer.testDelegationTokenOperations(TestHttpFSServer.java:352) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned
[ https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321968#comment-14321968 ] Frode Halvorsen commented on HDFS-7787: --- Sorry- my grep was wrong, and included a lot of replications for earlier times, but it was still on the same decom-node. The correct stats for the 10 minutes between 13:00 and 13.10 today is : a total of 3161 started threads. None of thos was for blocks with two live replicas, but 2430 was for blocks with one live replica and only 731 was blocks without live replicas. That means that only 1/4 of the blocks replicated was of the 'highest priority'. And of course this made my day worse ; I now hae to wait one month befor I can take down the node... Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned -- Key: HDFS-7787 URL: https://issues.apache.org/jira/browse/HDFS-7787 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.6.0 Environment: 2 namenodes HA, 6 datanodes in two racks Reporter: Frode Halvorsen Labels: balance, hdfs, replication-performance Each file has a setting of 3 replicas. split on different racks. After a simulated crash of one rack (shutdown of all nodes, deleted data-directory an started nodes) and decommssion of one of the nodes in the orther rack the replication does not follow 'normal' rules... My cluster has appx 25 mill files, and the one node I now try to decommision has 9 millions underreplicated blocks, and 3,5 million blocks with 'no live replicas'. After a restart of the node, it starts to replicate both types of blocks, but after a while, it only repliates under-replicated blocks with other live copies. I would think that the 'normal' way to do this would be to make sure that all blocks this node keeps the only copy of, should be the first to be replicated/balanced ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned
[ https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321963#comment-14321963 ] Frode Halvorsen commented on HDFS-7787: --- I just did a log-analysis of the decommissioning node, and looked at what it actually started to replicate during av ten-minute period. I filtered on the log-lines for 'Staring thread to transefer' and counted lines divided into replication to one, two or three nodes (blocks with 2, 1 and 0 live replicas). It started 5036 threads during the 10 minutes I loked at, and it was : 53 blokcs to one node (2 live replicas in the cluster) 3127 blocks to two nodes (blocks with one live replica) 1856 blocks to three nodes (blocks with no live replicas) Of course this is a problem for me, as I won't be able to kill the node totally before all blocks with no live replicas has been transfered. It's still 3.3 million of them, and at this rate I won't be able to kille the node for another week and a half :( Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned -- Key: HDFS-7787 URL: https://issues.apache.org/jira/browse/HDFS-7787 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.6.0 Environment: 2 namenodes HA, 6 datanodes in two racks Reporter: Frode Halvorsen Labels: balance, hdfs, replication-performance Each file has a setting of 3 replicas. split on different racks. After a simulated crash of one rack (shutdown of all nodes, deleted data-directory an started nodes) and decommssion of one of the nodes in the orther rack the replication does not follow 'normal' rules... My cluster has appx 25 mill files, and the one node I now try to decommision has 9 millions underreplicated blocks, and 3,5 million blocks with 'no live replicas'. After a restart of the node, it starts to replicate both types of blocks, but after a while, it only repliates under-replicated blocks with other live copies. I would think that the 'normal' way to do this would be to make sure that all blocks this node keeps the only copy of, should be the first to be replicated/balanced ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6501) TestCrcCorruption#testCorruptionDuringWrt sometimes fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved HDFS-6501. -- Resolution: Cannot Reproduce TestCrcCorruption#testCorruptionDuringWrt sometimes fails in trunk -- Key: HDFS-6501 URL: https://issues.apache.org/jira/browse/HDFS-6501 Project: Hadoop HDFS Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1767/ : {code} REGRESSION: org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt Error Message: test timed out after 5 milliseconds Stack Trace: java.lang.Exception: test timed out after 5 milliseconds at java.lang.Object.wait(Native Method) at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) at org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6726) TestNamenodeCapacityReport fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved HDFS-6726. -- Resolution: Cannot Reproduce TestNamenodeCapacityReport fails intermittently --- Key: HDFS-6726 URL: https://issues.apache.org/jira/browse/HDFS-6726 Project: Hadoop HDFS Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1812/testReport/junit/org.apache.hadoop.hdfs.server.namenode/TestNamenodeCapacityReport/testXceiverCount/ : {code} java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2141) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2109) at org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport.testXceiverCount(TestNamenodeCapacityReport.java:281) {code} There were multiple occurrences of 'Broken pipe', 'Connection reset by peer' and 'Premature EOF from inputStream' exceptions in test output -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7787) Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned
[ https://issues.apache.org/jira/browse/HDFS-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322120#comment-14322120 ] Frode Halvorsen commented on HDFS-7787: --- And later, it seems to get better :) Now (after another paramater-tuning for faster replication) it replicates 43.000 blocks / hour. And every block is one that has zero live replicas in the cluster :) It actually seems that the name-nodes needs time to calculate which blocks has higher priority. Now I only need three more days before I can take down the data-node :) As it turns out, it might just be parameteres that made me believe that it had a bad prioritizing-algorithm :) Too bad a lot of the parameters I now have changed is undocumented, but 'revealed' in different forum-postings... A quick look at logs on the active name-node reveals that it actually only ask the decommissioning node to replicate. No other nodes is contacted, thus it now only replicates nodes with no live replicas. It might be my parameter-settings, but it could actually have asked any of the other 5 datanodes to replicate the blocks with one live replica... I'll try to add even more replication-requests per heartbeat to see if it is able to make the other datanodes do any work as well. Split QUEUE_HIGHEST_PRIORITY in UnderReplicatedBlocks to give more priority to blocks on nodes being decomissioned -- Key: HDFS-7787 URL: https://issues.apache.org/jira/browse/HDFS-7787 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.6.0 Environment: 2 namenodes HA, 6 datanodes in two racks Reporter: Frode Halvorsen Labels: balance, hdfs, replication-performance Each file has a setting of 3 replicas. split on different racks. After a simulated crash of one rack (shutdown of all nodes, deleted data-directory an started nodes) and decommssion of one of the nodes in the orther rack the replication does not follow 'normal' rules... My cluster has appx 25 mill files, and the one node I now try to decommision has 9 millions underreplicated blocks, and 3,5 million blocks with 'no live replicas'. After a restart of the node, it starts to replicate both types of blocks, but after a while, it only repliates under-replicated blocks with other live copies. I would think that the 'normal' way to do this would be to make sure that all blocks this node keeps the only copy of, should be the first to be replicated/balanced ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Attachment: HDFS-7798.01.patch Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7800) Improve documentation for FileSystem.concat()
[ https://issues.apache.org/jira/browse/HDFS-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Armstrong updated HDFS-7800: -- Attachment: HDFS-7800-1.patch Improve documentation for FileSystem.concat() - Key: HDFS-7800 URL: https://issues.apache.org/jira/browse/HDFS-7800 Project: Hadoop HDFS Issue Type: Task Affects Versions: 2.2.0, 2.6.0 Reporter: Steve Armstrong Attachments: HDFS-7800-1.patch This is a documentation request. [FileSystem.concat()|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html] says it will Concat existing files together. It seems to be a Namenode-only operation though, mapping the data blocks into a single file. This means: # The destination must exist # The destination must be non-empty # The destination must have it's last block exactly full # All but the last of the source files must have their last block full # All the source file will be deleted by this operation HDFS-6641 brought up some of these limitations, but was closed as not a problem. I think the javadoc should be improved so it's clear this function was never intended to work the same as a general purpose file concatenation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Description: We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. was: We have observed in our real cluster occasionally checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Priority: Critical We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7799) Factor OSType out from Shell: change in HDFS
Yongjun Zhang created HDFS-7799: --- Summary: Factor OSType out from Shell: change in HDFS Key: HDFS-7799 URL: https://issues.apache.org/jira/browse/HDFS-7799 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Assignee: Chengbing Liu Status: Patch Available (was: Open) Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Assignee: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7537) fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart
[ https://issues.apache.org/jira/browse/HDFS-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322278#comment-14322278 ] GAO Rui commented on HDFS-7537: --- I want to try to do this. Please assign it to me, thank you! fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart - Key: HDFS-7537 URL: https://issues.apache.org/jira/browse/HDFS-7537 Project: Hadoop HDFS Issue Type: Improvement Reporter: Allen Wittenauer Attachments: dfs-min-2-fsck.png, dfs-min-2.png If minimum replication is set to 2 or higher and some of those replicas are missing and the namenode restarts, it isn't always obvious that the missing replicas are the reason why the namenode isn't leaving safemode. We should improve the output of fsck and the web UI to make it obvious that the missing blocks are from unmet replicas vs. completely/totally missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322279#comment-14322279 ] Yi Liu commented on HDFS-7798: -- Yes, {{KerberosAuthenticator}} is not thread-safe, we need to create a new instance when opening a new connection. +1 Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321793#comment-14321793 ] Chengbing Liu commented on HDFS-7798: - The checkpointing failure happens when image uploading and edit log fetching comes at the same time. Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Assignee: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Affects Version/s: 2.6.0 Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Priority: Critical Attachments: HDFS-7798.01.patch We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1436#comment-1436 ] Jitendra Nath Pandey commented on HDFS-7604: [~cnauroth], thanks for the due diligence to ensure a clean merge to branch-2. +1 for the latest patch. Track and display failed DataNode storage locations in NameNode. Key: HDFS-7604 URL: https://issues.apache.org/jira/browse/HDFS-7604 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, namenode Reporter: Chris Nauroth Assignee: Chris Nauroth Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604-screenshot-5.png, HDFS-7604-screenshot-6.png, HDFS-7604-screenshot-7.png, HDFS-7604.001.patch, HDFS-7604.002.patch, HDFS-7604.004.patch, HDFS-7604.005.patch, HDFS-7604.006.patch, HDFS-7604.prototype.patch During heartbeats, the DataNode can report a list of its storage locations that have been taken out of service due to failure (such as due to a bad disk or a permissions problem). The NameNode can track these failed storage locations and then report them in JMX and the NameNode web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks
[ https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Su resolved HDFS-7633. - Resolution: Done HDFS-7430. Refactor the BlockScanner and bug code is deleted. BlockPoolSliceScanner fails when Datanode has too many blocks - Key: HDFS-7633 URL: https://issues.apache.org/jira/browse/HDFS-7633 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Walter Su Assignee: Walter Su Priority: Minor Attachments: HDFS-7633.patch issue: When Total blocks of one of my DNs reaches 33554432, It refuses to accept more blocks, this is the ERROR. 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client at /172.1.1.8:50490 [Receiving block BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation src: /172.1.1.8:50490 dst: /172.1.1.11:25009 | org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250) java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276) at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193) at org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) analysis: in function org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime() when blockMap.size() is too big, Math.max(blockMap.size(),1) * 600 is int type, and negtive Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive (int)period is Integer.MIN_VALUE Math.abs((int)period) is Integer.MIN_VALUE , which is negtive DFSUtil.getRandom().nextInt(periodInt) will thows IllegalArgumentException I use Java HotSpot (build 1.7.0_05-b05) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks
[ https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Su updated HDFS-7633: Status: Open (was: Patch Available) HDFS-7430. Refactor the BlockScanner and bug code is deleted BlockPoolSliceScanner fails when Datanode has too many blocks - Key: HDFS-7633 URL: https://issues.apache.org/jira/browse/HDFS-7633 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Walter Su Assignee: Walter Su Priority: Minor Attachments: HDFS-7633.patch issue: When Total blocks of one of my DNs reaches 33554432, It refuses to accept more blocks, this is the ERROR. 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client at /172.1.1.8:50490 [Receiving block BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation src: /172.1.1.8:50490 dst: /172.1.1.11:25009 | org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250) java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276) at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193) at org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) analysis: in function org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime() when blockMap.size() is too big, Math.max(blockMap.size(),1) * 600 is int type, and negtive Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive (int)period is Integer.MIN_VALUE Math.abs((int)period) is Integer.MIN_VALUE , which is negtive DFSUtil.getRandom().nextInt(periodInt) will thows IllegalArgumentException I use Java HotSpot (build 1.7.0_05-b05) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads
[ https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322284#comment-14322284 ] Walter Su commented on HDFS-7430: - HDFS-7633 bug code is removed. Rewrite the BlockScanner to use O(1) memory and use multiple threads Key: HDFS-7430 URL: https://issues.apache.org/jira/browse/HDFS-7430 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Fix For: 2.7.0 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png We should update the BlockScanner to use a constant amount of memory by keeping track of what block was scanned last, rather than by tracking the scan status of all blocks in memory. Also, instead of having just one thread, we should have a verification thread per hard disk (or other volume), scanning at a configurable rate of bytes per second. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks
[ https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322283#comment-14322283 ] Walter Su commented on HDFS-7633: - HDFS-7430 Refactor the BlockScanner so the bug code HDFS-7633 is deleted. BlockPoolSliceScanner fails when Datanode has too many blocks - Key: HDFS-7633 URL: https://issues.apache.org/jira/browse/HDFS-7633 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Walter Su Assignee: Walter Su Priority: Minor Attachments: HDFS-7633.patch issue: When Total blocks of one of my DNs reaches 33554432, It refuses to accept more blocks, this is the ERROR. 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client at /172.1.1.8:50490 [Receiving block BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation src: /172.1.1.8:50490 dst: /172.1.1.11:25009 | org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250) java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276) at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193) at org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) analysis: in function org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime() when blockMap.size() is too big, Math.max(blockMap.size(),1) * 600 is int type, and negtive Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive (int)period is Integer.MIN_VALUE Math.abs((int)period) is Integer.MIN_VALUE , which is negtive DFSUtil.getRandom().nextInt(periodInt) will thows IllegalArgumentException I use Java HotSpot (build 1.7.0_05-b05) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7633) BlockPoolSliceScanner fails when Datanode has too many blocks
[ https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Su updated HDFS-7633: Fix Version/s: 2.6.1 BlockPoolSliceScanner fails when Datanode has too many blocks - Key: HDFS-7633 URL: https://issues.apache.org/jira/browse/HDFS-7633 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Walter Su Assignee: Walter Su Priority: Minor Fix For: 2.6.1 Attachments: HDFS-7633.patch issue: When Total blocks of one of my DNs reaches 33554432, It refuses to accept more blocks, this is the ERROR. 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client at /172.1.1.8:50490 [Receiving block BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation src: /172.1.1.8:50490 dst: /172.1.1.11:25009 | org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250) java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276) at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193) at org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) analysis: in function org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime() when blockMap.size() is too big, Math.max(blockMap.size(),1) * 600 is int type, and negtive Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive (int)period is Integer.MIN_VALUE Math.abs((int)period) is Integer.MIN_VALUE , which is negtive DFSUtil.getRandom().nextInt(periodInt) will thows IllegalArgumentException I use Java HotSpot (build 1.7.0_05-b05) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7537) fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart
[ https://issues.apache.org/jira/browse/HDFS-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui reassigned HDFS-7537: - Assignee: GAO Rui fsck is confusing when dfs.namenode.replication.min 1 missing replicas NN restart - Key: HDFS-7537 URL: https://issues.apache.org/jira/browse/HDFS-7537 Project: Hadoop HDFS Issue Type: Improvement Reporter: Allen Wittenauer Assignee: GAO Rui Attachments: dfs-min-2-fsck.png, dfs-min-2.png If minimum replication is set to 2 or higher and some of those replicas are missing and the namenode restarts, it isn't always obvious that the missing replicas are the reason why the namenode isn't leaving safemode. We should improve the output of fsck and the web UI to make it obvious that the missing blocks are from unmet replicas vs. completely/totally missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7648) Verify the datanode directory layout
[ https://issues.apache.org/jira/browse/HDFS-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh R updated HDFS-7648: --- Attachment: HDFS-7648-4.patch Verify the datanode directory layout Key: HDFS-7648 URL: https://issues.apache.org/jira/browse/HDFS-7648 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Tsz Wo Nicholas Sze Assignee: Rakesh R Attachments: HDFS-7648-3.patch, HDFS-7648-4.patch, HDFS-7648.patch, HDFS-7648.patch HDFS-6482 changed datanode layout to use block ID to determine the directory to store the block. We should have some mechanism to verify it. Either DirectoryScanner or block report generation could do the check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7764) DirectoryScanner should cancel the future tasks when #compileReport throws exception
[ https://issues.apache.org/jira/browse/HDFS-7764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh R updated HDFS-7764: --- Issue Type: Improvement (was: Bug) DirectoryScanner should cancel the future tasks when #compileReport throws exception Key: HDFS-7764 URL: https://issues.apache.org/jira/browse/HDFS-7764 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.7.0 Reporter: Rakesh R Assignee: Rakesh R Attachments: HDFS-7764.patch If there is an exception while preparing the ScanInfo for the blocks in the directory, DirectoryScanner is immediately throwing exception and coming out of the current scan cycle. It would be good if he can signal #cancel() to the other pending tasks . DirectoryScanner.java {code} for (EntryInteger, FutureScanInfoPerBlockPool report : compilersInProgress.entrySet()) { try { dirReports[report.getKey()] = report.getValue().get(); } catch (Exception ex) { LOG.error(Error compiling report, ex); // Propagate ex to DataBlockScanner to deal with throw new RuntimeException(ex); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7765) FSOutputSummer throwing ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/HDFS-7765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] surendra singh lilhore reassigned HDFS-7765: Assignee: surendra singh lilhore FSOutputSummer throwing ArrayIndexOutOfBoundsException -- Key: HDFS-7765 URL: https://issues.apache.org/jira/browse/HDFS-7765 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.6.0 Environment: Centos 6, Open JDK 7, Amazon EC2, Accumulo 1.6.2RC4 Reporter: Keith Turner Assignee: surendra singh lilhore While running an Accumulo test, saw exceptions like the following while trying to write to write ahead log in HDFS. The exception occurrs at [FSOutputSummer.java:76|https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSOutputSummer.java#L76] which is attempting to update a byte array. {noformat} 2015-02-06 19:46:49,769 [log.DfsLogger] WARN : Exception syncing java.lang.reflect.InvocationTargetException java.lang.ArrayIndexOutOfBoundsException: 4608 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:76) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:50) at java.io.DataOutputStream.write(DataOutputStream.java:88) at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) at org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87) at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:526) at org.apache.accumulo.tserver.log.DfsLogger.logFileData(DfsLogger.java:540) at org.apache.accumulo.tserver.log.DfsLogger.logManyTablets(DfsLogger.java:573) at org.apache.accumulo.tserver.log.TabletServerLogger$6.write(TabletServerLogger.java:373) at org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:274) at org.apache.accumulo.tserver.log.TabletServerLogger.logManyTablets(TabletServerLogger.java:365) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:1667) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.closeUpdate(TabletServer.java:1754) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46) at org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47) at com.sun.proxy.$Proxy22.closeUpdate(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2370) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2354) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516) at org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:744) 2015-02-06 19:46:49,769 [log.DfsLogger] WARN : Exception syncing java.lang.reflect.InvocationTargetException 2015-02-06 19:46:49,772 [log.DfsLogger] ERROR: java.lang.ArrayIndexOutOfBoundsException: 4609 java.lang.ArrayIndexOutOfBoundsException: 4609 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:76) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:50) at java.io.DataOutputStream.write(DataOutputStream.java:88) at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) at org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87) at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:526) at
[jira] [Commented] (HDFS-7797) Add audit log for setQuota operation
[ https://issues.apache.org/jira/browse/HDFS-7797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322339#comment-14322339 ] Rakesh R commented on HDFS-7797: The test failure is not related to the patch. Add audit log for setQuota operation Key: HDFS-7797 URL: https://issues.apache.org/jira/browse/HDFS-7797 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.6.0 Reporter: Rakesh R Assignee: Rakesh R Attachments: 001-HDFS-7797.patch SetQuota operation should be included in audit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7801) IOException:NameNode still not started cause DFSClient operation failure without retry.
zhihai xu created HDFS-7801: --- Summary: IOException:NameNode still not started cause DFSClient operation failure without retry. Key: HDFS-7801 URL: https://issues.apache.org/jira/browse/HDFS-7801 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Reporter: zhihai xu IOException:NameNode still not started cause DFSClient operation failure without retry. In YARN-1778, TestFSRMStateStore failed randomly, it is due to the java.io.IOException: NameNode still not started. The stack trace for this Exception is the following: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128) at org.apache.hadoop.ipc.Client.call(Client.java:1474) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy23.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy24.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364) at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273) 2015-02-03 00:09:19,089 INFO [IPC Server handler 0 on 57792] ipc.Server (Server.java:run(2155)) - IPC Server handler 0 on 57792, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 Call#14 Retry#1 java.io.IOException: NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at
[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum
[ https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322400#comment-14322400 ] Rakesh R commented on HDFS-4266: It seems test case failure is not related to this patch. [~umamaheswararao] could you please have a look at the latest patch. Thanks! BKJM: Separate write and ack quorum --- Key: HDFS-4266 URL: https://issues.apache.org/jira/browse/HDFS-4266 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Reporter: Ivan Kelly Assignee: Rakesh R Attachments: 001-HDFS-4266.patch, 002-HDFS-4266.patch, 003-HDFS-4266.patch, 004-HDFS-4266.patch, 005-HDFS-4266.patch BOOKKEEPER-208 allows the ack and write quorums to be different sizes to allow writes to be unaffected by any bookie failure. BKJM should be able to take advantage of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6662) [ UI ] Not able to open file from UI if file path contains %
[ https://issues.apache.org/jira/browse/HDFS-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322415#comment-14322415 ] Brahma Reddy Battula commented on HDFS-6662: HI [~gerson23] can you please adress [~ajisakaa] comments..? [ UI ] Not able to open file from UI if file path contains % -- Key: HDFS-6662 URL: https://issues.apache.org/jira/browse/HDFS-6662 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.1 Reporter: Brahma Reddy Battula Assignee: Gerson Carlos Priority: Critical Attachments: hdfs-6662.001.patch, hdfs-6662.002.patch, hdfs-6662.patch 1. write a file into HDFS is such a way that, file name is like 1%2%3%4 2. using NameNode UI browse the file throwing following Exception. Path does not exist on HDFS or WebHDFS is disabled. Please check your path or enable WebHDFS HBase write its WAL files data in HDFS using % contains in file name eg: /hbase/WALs/HOST-,60020,1404731504691/HOST-***-130%2C60020%2C1404731504691.1404812663950.meta the above file info is not opening in the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7764) DirectoryScanner should cancel the future tasks when #compileReport throws exception
[ https://issues.apache.org/jira/browse/HDFS-7764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322442#comment-14322442 ] Rakesh R commented on HDFS-7764: Thanks [~cmccabe] for the interest. bq.Can you explain why we would want to cancel the other pending tasks? I was reviewing the code and saw {{public ScanInfoPerBlockPool call() throws Exception}} can throw exception. My thought was, while iterating if any of the {{compilersInProgress}} throws exception then will propogate this exception and skip the current cycle. So as a good coding practise it would be good to cancel all the given future tasks before exiting. Does this make sense to you? bq. Did you see this code throw an exception in production? Truly not. DirectoryScanner should cancel the future tasks when #compileReport throws exception Key: HDFS-7764 URL: https://issues.apache.org/jira/browse/HDFS-7764 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.7.0 Reporter: Rakesh R Assignee: Rakesh R Attachments: HDFS-7764.patch If there is an exception while preparing the ScanInfo for the blocks in the directory, DirectoryScanner is immediately throwing exception and coming out of the current scan cycle. It would be good if he can signal #cancel() to the other pending tasks . DirectoryScanner.java {code} for (EntryInteger, FutureScanInfoPerBlockPool report : compilersInProgress.entrySet()) { try { dirReports[report.getKey()] = report.getValue().get(); } catch (Exception ex) { LOG.error(Error compiling report, ex); // Propagate ex to DataBlockScanner to deal with throw new RuntimeException(ex); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)