[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954304#comment-16954304
 ] 

Hadoop QA commented on HDFS-14768:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
58s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 39m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
 9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
31m 11s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
32s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 170 unchanged - 1 fixed = 170 total (was 171) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
26m  8s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}156m  8s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  1m 
30s{color} | {color:red} The patch generated 50 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}284m 48s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.datanode.TestNNHandlesCombinedBlockReport |
|   | hadoop.hdfs.tools.TestECAdmin |
|   | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics |
|   | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.security.TestDelegationTokenForProxyUser |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.TestDFSStripedInputStream |
|   | hadoop.hdfs.server.datanode.TestBlockHasMultipleReplicasOnSameDN |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeMXBean |
|   | hadoop.hdfs.TestDecommission |
|   | hadoop.hdfs.TestFileCreation |
|   | hadoop.hdfs.server.namenode.TestQuotaWithStripedBlocksWithRandomECPolicy |
|   | hadoop.hdfs.server.namenode.TestAddStripedBlocks |
|   | hadoop.hdfs.TestPread |
|   | hadoop.hdfs.server.datanode.TestDataNodeReconfiguration |
|   | hadoop.hdfs.tools.TestStoragePolicySatisfyAdminCommands |
|   | hadoop.hdfs.server.datanode.TestDataNodeUUID |
|   | hadoop.hdfs.server.namenode.TestDeleteRace |
|   | hadoop.hdfs.tools.TestDelegationTokenFetcher |
|   | hadoop.hdfs.TestParallelShortCircuitRead |
|   | hadoop.hdfs.TestFileCorruption |
|   | hadoop.hdfs.TestDatanodeReport |
|   | hadoop.hdfs.TestEncryptionZo

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954296#comment-16954296
 ] 

guojh commented on HDFS-14768:
--

rebase trunk

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.g

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.006.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954292#comment-16954292
 ] 

guojh commented on HDFS-14768:
--

[~hadoopqa] test

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager(

[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Siddharth Wagle (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Wagle updated HDDS-2323:
--
Status: Patch Available  (was: Open)

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: performance
> Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 
> AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Siddharth Wagle (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Wagle updated HDDS-2323:
--
Attachment: HDDS-2323.01.patch

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
> Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 
> AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Siddharth Wagle (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Wagle reassigned HDDS-2323:
-

Assignee: Siddharth Wagle

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: performance
> Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 
> AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954278#comment-16954278
 ] 

Hadoop QA commented on HDFS-14768:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 33m 
22s{color} | {color:red} root in trunk failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
54s{color} | {color:red} hadoop-hdfs in trunk failed. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
51s{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
40s{color} | {color:red} hadoop-hdfs in trunk failed. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  3m 
52s{color} | {color:red} branch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
51s{color} | {color:red} hadoop-hdfs in trunk failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
37s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
21s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
41s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 41s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 21s{color} | {color:orange} The patch fails to run checkstyle in hadoop-hdfs 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 12m 
39s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 37s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
58s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 74m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14768 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12983334/HDFS-14768.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5d416d38d52b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 54dc6b7 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| mvninstall | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-mvninstall-root.txt
 |
| compile | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| mvnsite | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| compile | 
https://builds.apache.org/job/PreCommit-HDFS-Build/

[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2019-10-17 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954273#comment-16954273
 ] 

Fei Hui commented on HDFS-14852:


Upload v004 patch.
* move test to TestLowRedundancyBlockQueues
* decrementBlockStat when remove block from QUEUE_WITH_CORRUPT_BLOCKS level 
successfully

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954261#comment-16954261
 ] 

Hadoop QA commented on HDFS-14768:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 36s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 170 unchanged - 1 fixed = 170 total (was 171) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 12s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 42s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}145m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockManager |
|   | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.TestDFSClientRetries |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14768 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12983324/HDFS-14768.006.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 88370772c07f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 54dc6b7 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28108/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28108/testReport/ |
| Max. process+thread count | 3571 (vs. 

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954257#comment-16954257
 ] 

guojh commented on HDFS-14768:
--

[~surendrasingh] Sorry about my thoughtless and thanks for you review.

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.g

[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2019-10-17 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-14852:
---
Attachment: HDFS-14852.004.patch

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.005.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDescriptor);

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.006.patch)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDescriptor);
>   //assertNu

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.005.patch)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDescriptor);
>   //assertNu

[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2019-10-17 Thread Ravuri Sushma sree (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954248#comment-16954248
 ] 

Ravuri Sushma sree commented on HDFS-14442:
---

[~xkrogen], Thank you so much for your valuable suggestions. 

I have uploaded a patch following up the first approach.

Can you please review 

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Priority: Major
> Attachments: HDFS-14442.001.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2019-10-17 Thread Ravuri Sushma sree (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravuri Sushma sree updated HDFS-14442:
--
Attachment: HDFS-14442.001.patch

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Priority: Major
> Attachments: HDFS-14442.001.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.006.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.006.patch)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDe

[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954245#comment-16954245
 ] 

guojh commented on HDFS-14699:
--

[~surendrasingh] The UT testChooseSrcDatanodesWithDupEC in class 
TestBlockManager.java. The code:
{code:java}
bm.chooseSourceDatanodes(
aBlockInfoStriped,
cntNodes,
liveNodes,
numReplicas, liveBlockIndices,
liveBusyBlockIndices,
LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY);
{code}
change to 
{code:java}
bm.chooseSourceDatanodes(
aBlockInfoStriped,
cntNodes,
liveNodes,
numReplicas, liveBlockIndices,
liveBusyBlockIndices,
LowRedundancyBlocks.QUEUE_VERY_LOW_REDUNDANCY);
{code}

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---
>
> Key: HDFS-14699
> URL: https://issues.apache.org/jira/browse/HDFS-14699
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.1.1, 3.3.0
>Reporter: Zhao Yi Ming
>Assignee: Zhao Yi Ming
>Priority: Critical
>  Labels: patch
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

2019-10-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954241#comment-16954241
 ] 

Surendra Singh Lilhore commented on HDFS-14699:
---

Where it is failing, can you give the link ?

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---
>
> Key: HDFS-14699
> URL: https://issues.apache.org/jira/browse/HDFS-14699
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.1.1, 3.3.0
>Reporter: Zhao Yi Ming
>Assignee: Zhao Yi Ming
>Priority: Critical
>  Labels: patch
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954239#comment-16954239
 ] 

Surendra Singh Lilhore commented on HDFS-14768:
---

{quote}I don't understand why the code that you point out is not unnecessary?  
liveBusyBlockIndices should include the condition below:
{quote}
Pls check my comment again. I just asked you to remove changes where just some 
space is added, no code change.

 

Will wait for build...

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // 

[jira] [Updated] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2311:
-
Labels: pull-request-available  (was: )

> Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
> ---
>
> Key: HDDS-2311
> URL: https://issues.apache.org/jira/browse/HDDS-2311
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Priority: Blocker
>  Labels: pull-request-available
>
> OzoneManagerProtocolClientSideTranslatorPB.java
> L251: if (cause instanceof NotLeaderException) {
>  NotLeaderException notLeaderException = (NotLeaderException) cause;
>  omFailoverProxyProvider.performFailoverIfRequired(
>  notLeaderException.getSuggestedLeaderNodeId());
>  return getRetryAction(RetryAction.RETRY, retries, failovers);
>  }
>  
> The suggested leader returned from Server is not used during failOver, as the 
> cause is a type of RemoteException. So with current code, it does not use 
> suggested leader for failOver at all and by default with each OM, it tries 
> max retries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2311?focusedWorklogId=330261&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330261
 ]

ASF GitHub Bot logged work on HDDS-2311:


Author: ASF GitHub Bot
Created on: 18/Oct/19 03:55
Start Date: 18/Oct/19 03:55
Worklog Time Spent: 10m 
  Work Description: hanishakoneru commented on pull request #51: HDDS-2311. 
Fix logic of RetryPolicy in OzoneClientSideTranslatorPB.
URL: https://github.com/apache/hadoop-ozone/pull/51
 
 
   OzoneManagerProtocolClientSideTranslatorPB.java
   
   L251: if (cause instanceof NotLeaderException)
   
   { NotLeaderException notLeaderException = (NotLeaderException) cause; 
omFailoverProxyProvider.performFailoverIfRequired( 
notLeaderException.getSuggestedLeaderNodeId()); return 
getRetryAction(RetryAction.RETRY, retries, failovers); }

   
   The suggested leader returned from Server is not used during failOver, as 
the cause is a type of RemoteException. So with current code, it does not use 
suggested leader for failOver at all and by default with each OM, it tries max 
retries.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330261)
Remaining Estimate: 0h
Time Spent: 10m

> Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
> ---
>
> Key: HDDS-2311
> URL: https://issues.apache.org/jira/browse/HDDS-2311
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> OzoneManagerProtocolClientSideTranslatorPB.java
> L251: if (cause instanceof NotLeaderException) {
>  NotLeaderException notLeaderException = (NotLeaderException) cause;
>  omFailoverProxyProvider.performFailoverIfRequired(
>  notLeaderException.getSuggestedLeaderNodeId());
>  return getRetryAction(RetryAction.RETRY, retries, failovers);
>  }
>  
> The suggested leader returned from Server is not used during failOver, as the 
> cause is a type of RemoteException. So with current code, it does not use 
> suggested leader for failOver at all and by default with each OM, it tries 
> max retries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Mukul Kumar Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954221#comment-16954221
 ] 

Mukul Kumar Singh commented on HDDS-2323:
-

cc: [~dchitlangia]

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HDDS-2323:
---
Labels: performance  (was: )

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HDDS-2323:
---
Component/s: Ozone Manager

> Mem allocation: Optimise AuditMessage::build()
> --
>
> Key: HDDS-2323
> URL: https://issues.apache.org/jira/browse/HDDS-2323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Priority: Major
> Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png
>
>
> String format allocates/processes more than 
> {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}
> {color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()

2019-10-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HDDS-2323:
--

 Summary: Mem allocation: Optimise AuditMessage::build()
 Key: HDDS-2323
 URL: https://issues.apache.org/jira/browse/HDDS-2323
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png

String format allocates/processes more than 
{color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color}

{color:#00}Would be good to use + instead of format.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2208) Propagate System Exceptions from OM transaction apply phase

2019-10-17 Thread Supratim Deka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supratim Deka updated HDDS-2208:

Status: Patch Available  (was: Open)

> Propagate System Exceptions from OM transaction apply phase
> ---
>
> Key: HDDS-2208
> URL: https://issues.apache.org/jira/browse/HDDS-2208
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Supratim Deka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The change for HDDS-2206 tracks system exceptions during preExecute phase of 
> OM request handling.
> The current jira is to implement exception propagation once the OM request is 
> submitted to Ratis - when the handler is running validateAndUpdateCache for 
> the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.006.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.006.patch)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDe

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954202#comment-16954202
 ] 

guojh commented on HDFS-14768:
--

[~surendrasingh] I put a new path06, please review it.

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(),

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.006.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode

[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954193#comment-16954193
 ] 

guojh commented on HDFS-14699:
--

[~zhaoyim] [~surendrasingh] In you test case, why you set the priority to 
QUEUE_HIGHEST_PRIORITY, If the priority is not equals 0, you test case is not 
passed.

 
{code:java}
bm.chooseSourceDatanodes(
aBlockInfoStriped,
cntNodes,
liveNodes,
numReplicas, liveBlockIndices,
liveBusyBlockIndices,
LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY); // If the priority set to 
QUEUE_VERY_LOW_REDUNDANCY, this will not works.
{code}
 

 

 

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---
>
> Key: HDFS-14699
> URL: https://issues.apache.org/jira/browse/HDFS-14699
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.1.1, 3.3.0
>Reporter: Zhao Yi Ming
>Assignee: Zhao Yi Ming
>Priority: Critical
>  Labels: patch
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.005.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDescriptor);

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.004.patch)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDe

[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.004.patch

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.004.patch, 
> HDFS-14768.jpg, guojh_UT_after_deomission.txt, 
> guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, 
> zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   bm.getDatanodeManager().removeDatanode(datanodeDescriptor);

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954183#comment-16954183
 ] 

guojh commented on HDFS-14768:
--

[~surendrasingh] Thanks for your review. I don't understand why the code that 
you point out is not unnecessary?  liveBusyBlockIndices should include the 
condition below:
{code:java}
if (priority != LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY
&& (!node.isDecommissionInProgress() && !node.isEnteringMaintenance())
&& node.getNumberOfBlocksToBeReplicated() >= maxReplicationStreams) {
  if (isStriped && state == StoredReplicaState.LIVE) {
liveBusyBlockIndices.add(blockIndex);
  }
  continue; // already reached replication limit
}
{code}

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageI

[jira] [Assigned] (HDDS-2322) DoubleBuffer flush termination and OM is shutdown

2019-10-17 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham reassigned HDDS-2322:


Assignee: Bharat Viswanadham

> DoubleBuffer flush termination and OM is shutdown
> -
>
> Key: HDDS-2322
> URL: https://issues.apache.org/jira/browse/HDDS-2322
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>
> om1_1       | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR      
> - Terminating with exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> om1_1       | java.util.ConcurrentModificationException
> om1_1       | at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> om1_1       | at 
> java.base/java.util.Collections$2.tryAdvance(Collections.java:4745)
> om1_1       | at 
> java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
> om1_1       | at 
> org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139)
> om1_1       | at 
> java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137)
> om1_1       | at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2322) DoubleBuffer flush termination and OM shutdown's after that.

2019-10-17 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham updated HDDS-2322:
-
Summary: DoubleBuffer flush termination and OM shutdown's after that.  
(was: DoubleBuffer flush termination and OM is shutdown)

> DoubleBuffer flush termination and OM shutdown's after that.
> 
>
> Key: HDDS-2322
> URL: https://issues.apache.org/jira/browse/HDDS-2322
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>
> om1_1       | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR      
> - Terminating with exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> om1_1       | java.util.ConcurrentModificationException
> om1_1       | at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> om1_1       | at 
> java.base/java.util.Collections$2.tryAdvance(Collections.java:4745)
> om1_1       | at 
> java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
> om1_1       | at 
> org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139)
> om1_1       | at 
> java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137)
> om1_1       | at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2322) DoubleBuffer flush termination and OM is shutdown

2019-10-17 Thread Bharat Viswanadham (Jira)
Bharat Viswanadham created HDDS-2322:


 Summary: DoubleBuffer flush termination and OM is shutdown
 Key: HDDS-2322
 URL: https://issues.apache.org/jira/browse/HDDS-2322
 Project: Hadoop Distributed Data Store
  Issue Type: Task
Reporter: Bharat Viswanadham


om1_1       | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR      - 
Terminating with exit status 2: OMDoubleBuffer flush 
threadOMDoubleBufferFlushThreadencountered Throwable error

om1_1       | java.util.ConcurrentModificationException

om1_1       | at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)

om1_1       | at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)

om1_1       | at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)

om1_1       | at 
org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65)

om1_1       | at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)

om1_1       | at 
java.base/java.util.Collections$2.tryAdvance(Collections.java:4745)

om1_1       | at 
java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)

om1_1       | at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)

om1_1       | at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)

om1_1       | at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)

om1_1       | at 
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362)

om1_1       | at 
org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37)

om1_1       | at 
org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31)

om1_1       | at 
org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)

om1_1       | at 
org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)

om1_1       | at 
org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58)

om1_1       | at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139)

om1_1       | at 
java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)

om1_1       | at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137)

om1_1       | at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2131) Optimize replication type and creation time calculation in S3 MPU list call

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2131?focusedWorklogId=330179&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330179
 ]

ASF GitHub Bot logged work on HDDS-2131:


Author: ASF GitHub Bot
Created on: 17/Oct/19 23:03
Start Date: 17/Oct/19 23:03
Worklog Time Spent: 10m 
  Work Description: swagle commented on pull request #50: HDDS-2131. 
Optimize replication type and creation time calculation in S3 MPU list call.
URL: https://github.com/apache/hadoop-ozone/pull/50
 
 
   ## What changes were proposed in this pull request?
   Optimize listMultiPartUpload to not read from openKeyTable.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-2131
   
   ## How was this patch tested?
   Ran related unit tests.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330179)
Remaining Estimate: 0h
Time Spent: 10m

> Optimize replication type and creation time calculation in S3 MPU list call
> ---
>
> Key: HDDS-2131
> URL: https://issues.apache.org/jira/browse/HDDS-2131
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on the review from [~bharatviswa]:
> {code}
>  
> hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java
>   metadataManager.getOpenKeyTable();
>   OmKeyInfo omKeyInfo =
>   openKeyTable.get(upload.getDbKey());
> {code}
> {quote}Here we are reading openKeyTable only for getting creation time. If we 
> can have this information in omMultipartKeyInfo, we could avoid DB calls for 
> openKeyTable.
> To do this, We can set creationTime in OmMultipartKeyInfo during 
> initiateMultipartUpload . In this way, we can get all the required 
> information from the MultipartKeyInfo table.
> And also StorageClass is missing from the returned OmMultipartUpload, as 
> listMultipartUploads shows StorageClass information. For this, if we can 
> return replicationType and depending on this value, we can set StorageClass 
> in the listMultipartUploads Response.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2131) Optimize replication type and creation time calculation in S3 MPU list call

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2131:
-
Labels: pull-request-available  (was: )

> Optimize replication type and creation time calculation in S3 MPU list call
> ---
>
> Key: HDDS-2131
> URL: https://issues.apache.org/jira/browse/HDDS-2131
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: pull-request-available
>
> Based on the review from [~bharatviswa]:
> {code}
>  
> hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java
>   metadataManager.getOpenKeyTable();
>   OmKeyInfo omKeyInfo =
>   openKeyTable.get(upload.getDbKey());
> {code}
> {quote}Here we are reading openKeyTable only for getting creation time. If we 
> can have this information in omMultipartKeyInfo, we could avoid DB calls for 
> openKeyTable.
> To do this, We can set creationTime in OmMultipartKeyInfo during 
> initiateMultipartUpload . In this way, we can get all the required 
> information from the MultipartKeyInfo table.
> And also StorageClass is missing from the returned OmMultipartUpload, as 
> listMultipartUploads shows StorageClass information. For this, if we can 
> return replicationType and depending on this value, we can set StorageClass 
> in the listMultipartUploads Response.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.

2019-10-17 Thread Aravindan Vijayan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aravindan Vijayan reassigned HDDS-2320:
---

Assignee: Aravindan Vijayan

> Negative value seen for OM NumKeys Metric in JMX.
> -
>
> Key: HDDS-2320
> URL: https://issues.apache.org/jira/browse/HDDS-2320
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Aravindan Vijayan
>Assignee: Aravindan Vijayan
>Priority: Major
> Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png
>
>
> While running teragen/terasort on a cluster and verifying number of keys 
> created on Ozone Manager, I noticed that the value of NumKeys counter metric 
> to be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2254) Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention

2019-10-17 Thread Anu Engineer (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer resolved HDDS-2254.

Fix Version/s: 0.5.0
   Resolution: Fixed

Committed to the master branch.

> Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention
> ---
>
> Key: HDDS-2254
> URL: https://issues.apache.org/jira/browse/HDDS-2254
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Siddharth Wagle
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Test always fails with assertion error:
> {code}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.ozone.client.rpc.TestContainerStateMachine.testRatisSnapshotRetention(TestContainerStateMachine.java:188)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2254) Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2254?focusedWorklogId=330168&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330168
 ]

ASF GitHub Bot logged work on HDDS-2254:


Author: ASF GitHub Bot
Created on: 17/Oct/19 22:12
Start Date: 17/Oct/19 22:12
Worklog Time Spent: 10m 
  Work Description: anuengineer commented on pull request #31: HDDS-2254. 
Fix flaky unit test TestContainerStateMachine#testRatisSn…
URL: https://github.com/apache/hadoop-ozone/pull/31
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330168)
Time Spent: 1h 50m  (was: 1h 40m)

> Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention
> ---
>
> Key: HDDS-2254
> URL: https://issues.apache.org/jira/browse/HDDS-2254
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Siddharth Wagle
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Test always fails with assertion error:
> {code}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.ozone.client.rpc.TestContainerStateMachine.testRatisSnapshotRetention(TestContainerStateMachine.java:188)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14912) Set dfs.image.string-tables.expanded default to false in branch-2.7

2019-10-17 Thread Siyao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDFS-14912:
--
Summary: Set dfs.image.string-tables.expanded default to false in 
branch-2.7  (was: Set dfs.image.string-tables.expanded default config value to 
false in branch-2.7)

> Set dfs.image.string-tables.expanded default to false in branch-2.7
> ---
>
> Key: HDFS-14912
> URL: https://issues.apache.org/jira/browse/HDFS-14912
> Project: Hadoop HDFS
>  Issue Type: Task
>Affects Versions: 2.7.8
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-14912.001.branch-2.7.patch
>
>
> In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, 
> dfs.image.string-tables.expanded is set to true by default: 
> https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627
> This is different from all other branches, which set it to false by default.
> For instance, branch-2.8: 
> https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629
> Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to 
> false to make it consistent with other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14912) Set dfs.image.string-tables.expanded default config value to false in branch-2.7

2019-10-17 Thread Siyao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDFS-14912:
--
Summary: Set dfs.image.string-tables.expanded default config value to false 
in branch-2.7  (was: Default dfs.image.string-tables.expanded to false in 
branch-2.7)

> Set dfs.image.string-tables.expanded default config value to false in 
> branch-2.7
> 
>
> Key: HDFS-14912
> URL: https://issues.apache.org/jira/browse/HDFS-14912
> Project: Hadoop HDFS
>  Issue Type: Task
>Affects Versions: 2.7.8
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-14912.001.branch-2.7.patch
>
>
> In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, 
> dfs.image.string-tables.expanded is set to true by default: 
> https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627
> This is different from all other branches, which set it to false by default.
> For instance, branch-2.8: 
> https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629
> Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to 
> false to make it consistent with other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14912) Default dfs.image.string-tables.expanded to false in branch-2.7

2019-10-17 Thread Siyao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDFS-14912:
--
Attachment: HDFS-14912.001.branch-2.7.patch
Status: Patch Available  (was: Open)

> Default dfs.image.string-tables.expanded to false in branch-2.7
> ---
>
> Key: HDFS-14912
> URL: https://issues.apache.org/jira/browse/HDFS-14912
> Project: Hadoop HDFS
>  Issue Type: Task
>Affects Versions: 2.7.8
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-14912.001.branch-2.7.patch
>
>
> In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, 
> dfs.image.string-tables.expanded is set to true by default: 
> https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627
> This is different from all other branches, which set it to false by default.
> For instance, branch-2.8: 
> https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629
> Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to 
> false to make it consistent with other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14912) Default dfs.image.string-tables.expanded to false in branch-2.7

2019-10-17 Thread Siyao Meng (Jira)
Siyao Meng created HDFS-14912:
-

 Summary: Default dfs.image.string-tables.expanded to false in 
branch-2.7
 Key: HDFS-14912
 URL: https://issues.apache.org/jira/browse/HDFS-14912
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 2.7.8
Reporter: Siyao Meng
Assignee: Siyao Meng


In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, 
dfs.image.string-tables.expanded is set to true by default: 
https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627

This is different from all other branches, which set it to false by default.
For instance, branch-2.8: 
https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629

Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to false 
to make it consistent with other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2310) Add support to add ozone ranger plugin to Ozone Manager classpath

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2310?focusedWorklogId=330126&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330126
 ]

ASF GitHub Bot logged work on HDDS-2310:


Author: ASF GitHub Bot
Created on: 17/Oct/19 20:35
Start Date: 17/Oct/19 20:35
Worklog Time Spent: 10m 
  Work Description: vivekratnavel commented on pull request #49: HDDS-2310. 
Add support to add ozone ranger plugin to Ozone Manager cl…
URL: https://github.com/apache/hadoop-ozone/pull/49
 
 
   …asspath.
   
   ## What changes were proposed in this pull request?
   
   This PR adds a new hadoop shell profile for ozone manager to be able to 
extend Ozone Manager's classpath. It is implemented in a generic way and is not 
exclusive to ranger plugin. Any path can be added to Ozone Manager classpath by 
setting the env variable `OZONE_MANAGER_CLASSPATH`
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2310
   
   ## How was this patch tested?
   
   Tested by running ozone manager locally with the above mentioned env set to 
some path.
   ```
export OZONE_MANAGER_CLASSPATH=/tmp/*
   ./hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/bin/ozone --debug om
   ```
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330126)
Remaining Estimate: 0h
Time Spent: 10m

> Add support to add ozone ranger plugin to Ozone Manager classpath
> -
>
> Key: HDDS-2310
> URL: https://issues.apache.org/jira/browse/HDDS-2310
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: Ozone Manager
>Affects Versions: 0.5.0
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Vivek Ratnavel Subramanian
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, there is no way to add Ozone Ranger plugin to Ozone Manager 
> classpath. 
> We should be able to set an environment variable that will be respected by 
> ozone and added to Ozone Manager classpath.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2310) Add support to add ozone ranger plugin to Ozone Manager classpath

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2310:
-
Labels: pull-request-available  (was: )

> Add support to add ozone ranger plugin to Ozone Manager classpath
> -
>
> Key: HDDS-2310
> URL: https://issues.apache.org/jira/browse/HDDS-2310
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: Ozone Manager
>Affects Versions: 0.5.0
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Vivek Ratnavel Subramanian
>Priority: Major
>  Labels: pull-request-available
>
> Currently, there is no way to add Ozone Ranger plugin to Ozone Manager 
> classpath. 
> We should be able to set an environment variable that will be respected by 
> ozone and added to Ozone Manager classpath.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2181) Ozone Manager should send correct ACL type in ACL requests to Authorizer

2019-10-17 Thread Vivek Ratnavel Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Ratnavel Subramanian updated HDDS-2181:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Ozone Manager should send correct ACL type in ACL requests to Authorizer
> 
>
> Key: HDDS-2181
> URL: https://issues.apache.org/jira/browse/HDDS-2181
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Vivek Ratnavel Subramanian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Currently, Ozone manager sends "WRITE" as ACLType for key create, key delete 
> and bucket create operation. Fix the acl type in all requests to the 
> authorizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.

2019-10-17 Thread Aravindan Vijayan (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954076#comment-16954076
 ] 

Aravindan Vijayan commented on HDDS-2320:
-

cc [~hanishakoneru] / [~bharat]

> Negative value seen for OM NumKeys Metric in JMX.
> -
>
> Key: HDDS-2320
> URL: https://issues.apache.org/jira/browse/HDDS-2320
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Aravindan Vijayan
>Priority: Major
> Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png
>
>
> While running teragen/terasort on a cluster and verifying number of keys 
> created on Ozone Manager, I noticed that the value of NumKeys counter metric 
> to be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-17 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954062#comment-16954062
 ] 

Anu Engineer commented on HDDS-2321:


Since SCM has the root cert, it might be intresting if it send a token over, 
that way these commands are also verified.

In the long run, or even the short run, these SCM commands to DNs will go away.

> Ozone Block Token verify should not apply to all datanode cmd
> -
>
> Key: HDDS-2321
> URL: https://issues.apache.org/jira/browse/HDDS-2321
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Nilotpal Nandi
>Assignee: Xiaoyu Yao
>Priority: Major
>
> DN container protocol has cmd send from SCM or other DN, which do not bear OM 
> block token like OM client. We should restrict the OM Block token check only 
> for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError

2019-10-17 Thread bb (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bb updated HDFS-14911:
--
Issue Type: Bug  (was: Wish)

> hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError
> --
>
> Key: HDFS-14911
> URL: https://issues.apache.org/jira/browse/HDFS-14911
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Affects Versions: 2.6.0
>Reporter: bb
>Priority: Major
>
> The hdfs oiv "Delimited" processer is failing as follows:
> {code:java}
> root@hostname [/dirname] $ hdfs oiv --inputFile 
> /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile 
> /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|"
>  Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load 
> library. Reasons: [no leveldbjni64-1.8 in java.library.path, no 
> leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, 
> /tmp/libleveldbjni-64-1-1996775695699915007.8: 
> /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from 
> shared object: Operation not permitted]
>  at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
>  at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
>  at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError

2019-10-17 Thread bb (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bb updated HDFS-14911:
--
Description: 
The hdfs oiv "Delimited" processer is failing as follows:
{code:java}
root@hostname [/dirname] $ hdfs oiv --inputFile 
/dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile 
/dirname/user1/fsimage.dsv --processor Delimited -delimiter "|"
 Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load 
library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 
in java.library.path, no leveldbjni in java.library.path, 
/tmp/libleveldbjni-64-1-1996775695699915007.8: 
/tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from 
shared object: Operation not permitted]
 at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
 at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
 at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138){code}

  was:
The hdfs oiv "Delimited" processer is failing as follows:

root@hostname [/dirname] $ hdfs oiv --inputFile 
/dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile 
/dirname/user1/fsimage.dsv --processor Delimited -delimiter "|"
Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load 
library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 
in java.library.path, no leveldbjni in java.library.path, 
/tmp/libleveldbjni-64-1-1996775695699915007.8: 
/tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from 
shared object: Operation not permitted]
 at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
 at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
 at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138)


> hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError
> --
>
> Key: HDFS-14911
> URL: https://issues.apache.org/jira/browse/HDFS-14911
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs, namenode
>Affects Versions: 2.6.0
>Reporter: bb
>Priority: Major
>
> The hdfs oiv "Delimited" processer is failing as follows:
> {code:java}
> root@hostname [/dirname] $ hdfs oiv --inputFile 
> /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile 
> /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|"
>  Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load 
> library. Reasons: [no leveldbjni64-1.8 in java.library.path, no 
> leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, 
> /tmp/libleveldbjni-64-1-1996775695699915007.8: 
> /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from 
> shared object: Operation not permitted]
>  at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
>  at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
>  at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409)
>  at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDeli

[jira] [Created] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-17 Thread Xiaoyu Yao (Jira)
Xiaoyu Yao created HDDS-2321:


 Summary: Ozone Block Token verify should not apply to all datanode 
cmd
 Key: HDDS-2321
 URL: https://issues.apache.org/jira/browse/HDDS-2321
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Affects Versions: 0.4.1
Reporter: Nilotpal Nandi
Assignee: Xiaoyu Yao


DN container protocol has cmd send from SCM or other DN, which do not bear OM 
block token like OM client. We should restrict the OM Block token check only 
for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2181) Ozone Manager should send correct ACL type in ACL requests to Authorizer

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2181?focusedWorklogId=330077&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330077
 ]

ASF GitHub Bot logged work on HDDS-2181:


Author: ASF GitHub Bot
Created on: 17/Oct/19 19:13
Start Date: 17/Oct/19 19:13
Worklog Time Spent: 10m 
  Work Description: xiaoyuyao commented on pull request #43: HDDS-2181. 
Ozone Manager should send correct ACL type in ACL requests…
URL: https://github.com/apache/hadoop-ozone/pull/43
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330077)
Time Spent: 11h 10m  (was: 11h)

> Ozone Manager should send correct ACL type in ACL requests to Authorizer
> 
>
> Key: HDDS-2181
> URL: https://issues.apache.org/jira/browse/HDDS-2181
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Vivek Ratnavel Subramanian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Currently, Ozone manager sends "WRITE" as ACLType for key create, key delete 
> and bucket create operation. Fix the acl type in all requests to the 
> authorizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.

2019-10-17 Thread Aravindan Vijayan (Jira)
Aravindan Vijayan created HDDS-2320:
---

 Summary: Negative value seen for OM NumKeys Metric in JMX.
 Key: HDDS-2320
 URL: https://issues.apache.org/jira/browse/HDDS-2320
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Manager
Reporter: Aravindan Vijayan
 Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png

While running teragen/terasort on a cluster and verifying number of keys 
created on Ozone Manager, I noticed that the value of NumKeys counter metric to 
be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! .





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError

2019-10-17 Thread bb (Jira)
bb created HDFS-14911:
-

 Summary: hdfs oiv "Delimited" processer failing with 
java.lang.UnsatisfiedLinkError
 Key: HDFS-14911
 URL: https://issues.apache.org/jira/browse/HDFS-14911
 Project: Hadoop HDFS
  Issue Type: Wish
  Components: hdfs, namenode
Affects Versions: 2.6.0
Reporter: bb


The hdfs oiv "Delimited" processer is failing as follows:

root@hostname [/dirname] $ hdfs oiv --inputFile 
/dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile 
/dirname/user1/fsimage.dsv --processor Delimited -delimiter "|"
Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load 
library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 
in java.library.path, no leveldbjni in java.library.path, 
/tmp/libleveldbjni-64-1-1996775695699915007.8: 
/tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from 
shared object: Operation not permitted]
 at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
 at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
 at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210)
 at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-10-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953990#comment-16953990
 ] 

Surendra Singh Lilhore commented on HDFS-14768:
---

Thanks [~gjhkael] for path.

Lets handle other scenario in HDFS-14847.

Changes look good to me. Minor comments :
 # Pls fix the check-style warning.
 # No change in {{StripedWriter.java}}, pls remove it from patch.
 # Some unnecessary changes are there which is not required.

{code:java}
-blockIndex = ((BlockInfoStriped) block)
-.getStorageBlockIndex(storage);
+blockIndex = ((BlockInfoStriped) block).
+getStorageBlockIndex(storage);
 if (state == StoredReplicaState.LIVE) {
   if (!bitSet.get(blockIndex)) {
 bitSet.set(blockIndex);
-  } else {
+  } else  { {code}
{code:java}
   (getSrcNodes()[i].isEnteringMaintenance() &&
-  getSrcNodes()[i].isAlive())) {
-srcIndices.add(i);
+  getSrcNodes()[i].isAlive())) {
+  srcIndices.add(i);
   } {code}
.

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanode

[jira] [Assigned] (HDDS-2145) Optimize client read path by reading multiple chunks along with block info in a single rpc call.

2019-10-17 Thread Mukul Kumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh reassigned HDDS-2145:
---

Assignee: Hanisha Koneru  (was: Shashikant Banerjee)

> Optimize client read path by reading multiple chunks along with block info in 
> a single rpc call.
> 
>
> Key: HDDS-2145
> URL: https://issues.apache.org/jira/browse/HDDS-2145
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client, Ozone Datanode
>Reporter: Shashikant Banerjee
>Assignee: Hanisha Koneru
>Priority: Major
> Fix For: 0.5.0
>
>
> Currently, ozone client issues a getBlock call to read the metadata info from 
> rocks Db on dn to get the chunkInfo and then chunk info is read one by one 
> inn separate rpc calls in the read path. This can be optimized by 
> piggybacking readChunk calls along with getBlock in a single rpc call to dn. 
> This Jira aims to address this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953972#comment-16953972
 ] 

Hadoop QA commented on HDFS-14854:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 40s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
13s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 45s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 11 new + 462 unchanged - 5 fixed = 473 total (was 467) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 99m 53s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}170m 32s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.tools.TestDFSZKFailoverController |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14854 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12983289/HDFS-14854.011.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux a1d0fabd0d16 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 3990ffa |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28104/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| u

[jira] [Commented] (HDFS-14384) When lastLocatedBlock token expire, it will take 1~3s second to refetch it.

2019-10-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953971#comment-16953971
 ] 

Surendra Singh Lilhore commented on HDFS-14384:
---

Thanks [~vinayakumarb]  for review. Will wait for [~daryn] comments.

> When lastLocatedBlock token expire, it will take 1~3s second to refetch it.
> ---
>
> Key: HDFS-14384
> URL: https://issues.apache.org/jira/browse/HDFS-14384
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.2
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-14384.001.patch, HDFS-14384.002.patch
>
>
> Scenario :
>  1. Write file with one block which is in-progress.
>   2. Open input stream and close the output stream.
>   3. Wait for block token expiration and read the data.
>   4. Last block read take 1~3 sec to read it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2318:
-
Labels: performance pull-request-available  (was: performance)

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2318?focusedWorklogId=330026&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330026
 ]

ASF GitHub Bot logged work on HDDS-2318:


Author: ASF GitHub Bot
Created on: 17/Oct/19 17:39
Start Date: 17/Oct/19 17:39
Worklog Time Spent: 10m 
  Work Description: bharatviswa504 commented on pull request #48: 
HDDS-2318. Avoid proto::tostring in preconditions to save CPU cycles.
URL: https://github.com/apache/hadoop-ozone/pull/48
 
 
   ## What changes were proposed in this pull request?
   
   Avoiding proto::toString in OzoneManagerServerSideTranslatorPB.java
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2318
   
   ## How was this patch tested?
   
   Ran TestOzoneRpcClient which executes this code path.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330026)
Remaining Estimate: 0h
Time Spent: 10m

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope

2019-10-17 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-14909:
--
Fix Version/s: 3.2.2
   3.1.4
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~elgoiri] and [~brahmareddy] for review.

Committed to branch-3.1, branch-3.2 and trunk.

> DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage 
> count for excluded node which is already part of excluded scope 
> -
>
> Key: HDFS-14909
> URL: https://issues.apache.org/jira/browse/HDFS-14909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, 
> HDFS-14909.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope

2019-10-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953943#comment-16953943
 ] 

Hudson commented on HDFS-14909:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17546 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17546/])
HDFS-14909. DFSNetworkTopology#chooseRandomWithStorageType() should not 
(surendralilhore: rev 54dc6b7d720851eb6017906d664aa0fda2698225)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DFSNetworkTopology.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMissingBlocksAlert.java


> DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage 
> count for excluded node which is already part of excluded scope 
> -
>
> Key: HDFS-14909
> URL: https://issues.apache.org/jira/browse/HDFS-14909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, 
> HDFS-14909.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently

2019-10-17 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-14910:
--

Assignee: Wei-Chiu Chuang

> TestRenameWithSnapshots#testRename2PreDescendant failing consistently
> -
>
> Key: HDFS-14910
> URL: https://issues.apache.org/jira/browse/HDFS-14910
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Íñigo Goiri
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestRenameWithSnapshots#testRename2PreDescendant has been failing 
> consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread Attila Doroszlai (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953929#comment-16953929
 ] 

Attila Doroszlai commented on HDDS-2318:


Using a supplier with the right {{toString()}} method might help:

https://github.com/apache/incubator-ratis/blob/master/ratis-common/src/main/java/org/apache/ratis/util/StringUtils.java#L130-L137

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2240) Command line tool for OM Admin

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2240?focusedWorklogId=330005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330005
 ]

ASF GitHub Bot logged work on HDDS-2240:


Author: ASF GitHub Bot
Created on: 17/Oct/19 17:06
Start Date: 17/Oct/19 17:06
Worklog Time Spent: 10m 
  Work Description: anuengineer commented on pull request #1586: HDDS-2240. 
Command line tool for OM HA.
URL: https://github.com/apache/hadoop/pull/1586#discussion_r336122720
 
 

 ##
 File path: hadoop-ozone/common/src/main/proto/OzoneManagerProtocol.proto
 ##
 @@ -1097,11 +1097,34 @@ message UpdateGetS3SecretRequest {
 required string awsSecret = 2;
 }
 
+message OMServiceId {
+required string serviceID = 1;
+}
+
+/**
+  This proto is used to define the OM node Id and its ratis server state.
+*/
+message RoleInfo {
+required string omNodeID = 1;
+required string ratisServerRole = 2;
+}
+
+/**
+  This is used to get the Server States of OMs.
+*/
+message ServiceState {
+repeated RoleInfo roleInfos = 1;
+}
+
 /**
  The OM service that takes care of Ozone namespace.
 */
 service OzoneManagerService {
 // A client-to-OM RPC to send client requests to OM Ratis server
 rpc submitRequest(OMRequest)
   returns(OMResponse);
+
+// A client-to-OM RPC to get ratis server states of OMs
+rpc getServiceState(OMServiceId)
+  returns(ServiceState);
 }
 
 Review comment:
   Sorry did not see this comment. How can a client communicate to OM? Does it 
not need to send the request to the leader ? 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330005)
Time Spent: 3h 20m  (was: 3h 10m)

> Command line tool for OM Admin
> --
>
> Key: HDDS-2240
> URL: https://issues.apache.org/jira/browse/HDDS-2240
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> A command line tool (*ozone omha*) to get information related to OM HA. 
> This Jira proposes to add the _getServiceState_ option for OM HA which lists 
> all the OMs in the service and their corresponding Ratis server roles 
> (LEADER/ FOLLOWER). 
> We can later add more options to this tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2319) CLI command to perform on-demand data scan of a specific container

2019-10-17 Thread Attila Doroszlai (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated HDDS-2319:
---
Description: On-demand data scan for a specific container might be a useful 
debug tool.  Thanks [~aengineer] for the idea.  (was: On-demand data scan for a 
specific container might be a useful debug tool.)

> CLI command to perform on-demand data scan of a specific container
> --
>
> Key: HDDS-2319
> URL: https://issues.apache.org/jira/browse/HDDS-2319
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone CLI
>Reporter: Attila Doroszlai
>Priority: Major
>
> On-demand data scan for a specific container might be a useful debug tool.  
> Thanks [~aengineer] for the idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2319) CLI command to perform on-demand data scan of a specific container

2019-10-17 Thread Attila Doroszlai (Jira)
Attila Doroszlai created HDDS-2319:
--

 Summary: CLI command to perform on-demand data scan of a specific 
container
 Key: HDDS-2319
 URL: https://issues.apache.org/jira/browse/HDDS-2319
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: Ozone CLI
Reporter: Attila Doroszlai


On-demand data scan for a specific container might be a useful debug tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently

2019-10-17 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953916#comment-16953916
 ] 

Ayush Saxena commented on HDFS-14910:
-

The this.removeFeature(..) is getting called two times in 
{{InodeDirectory.java}}. 
The only possible solution to fix that I found was something like this in 
{{InodeDirectory.java}} L843 :

{code:java}
-  if (sf.getDiffs().isEmpty() &&
-  !(sf instanceof DirectorySnapshottableFeature)) {
+  if (sf.getDiffs().isEmpty()
+  && !(sf instanceof DirectorySnapshottableFeature)
+  && this.getDirectorySnapshottableFeature() != null)
{code}

[~weichiu] [~shashikant] if you have any other solution. Let me know. 
Otherwise, I can put up the above written code, adding a check before removing 
the feature, that it exists actually or not.

> TestRenameWithSnapshots#testRename2PreDescendant failing consistently
> -
>
> Key: HDFS-14910
> URL: https://issues.apache.org/jira/browse/HDFS-14910
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Íñigo Goiri
>Priority: Major
>
> TestRenameWithSnapshots#testRename2PreDescendant has been failing 
> consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) Review FSNameSystem editlog sync

2019-10-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953914#comment-16953914
 ] 

Hudson commented on HDFS-14810:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17545 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17545/])
HDFS-14810. Review FSNameSystem editlog sync. Contributed by Xiaoqiao 
(ayushsaxena: rev 5527d79adb9b1e2f2779c283f81d6a3d5447babc)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java


> Review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable

2019-10-17 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953909#comment-16953909
 ] 

Takanobu Asanuma commented on HDFS-14887:
-

Thanks for the discussion and updating the patch, [~hemanthboyina] and 
[~elgoiri].

I tested with 008 patch and found that the observer icon is used in the 
Nameservice Information when there are ActiveNN and ObserverNN in the same 
nameservice. Please see [^14887.008.png].
 * This is because the enum order in {{FederationNamenodeServiceState}} is used 
for the comparator. I think the order should be ACTIVE > OBSERVER > STANDBY.
 * I want to use the same order(active>observer>stadby) in 
{{FederationNamenodeServiceState#getState()}} and {{augment_namenodes()}} in 
federationhealth.js.
 * We may also need to add the observer icon to 
federationhealth-namenode-legend in Nameservice Information.

> RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
> --
>
> Key: HDFS-14887
> URL: https://issues.apache.org/jira/browse/HDFS-14887
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: 14887.008.png, 14887.after.png, 14887.before.png, 
> HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, 
> HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, 
> HDFS-14887.007.patch, HDFS-14887.008.patch
>
>
> In Router Web UI, Observer Namenode Information displaying as Unavailable.
> We should show a proper icon for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953907#comment-16953907
 ] 

Hadoop QA commented on HDFS-14887:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  9s{color} 
| {color:red} HDFS-14887 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14887 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28105/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
> --
>
> Key: HDFS-14887
> URL: https://issues.apache.org/jira/browse/HDFS-14887
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: 14887.008.png, 14887.after.png, 14887.before.png, 
> HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, 
> HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, 
> HDFS-14887.007.patch, HDFS-14887.008.patch
>
>
> In Router Web UI, Observer Namenode Information displaying as Unavailable.
> We should show a proper icon for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently

2019-10-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953904#comment-16953904
 ] 

Íñigo Goiri commented on HDFS-14910:


[~ayushsaxena] narrowed it down to HDFS-14492.
[~weichiu], [~shashikant], any insights?

> TestRenameWithSnapshots#testRename2PreDescendant failing consistently
> -
>
> Key: HDFS-14910
> URL: https://issues.apache.org/jira/browse/HDFS-14910
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Íñigo Goiri
>Priority: Major
>
> TestRenameWithSnapshots#testRename2PreDescendant has been failing 
> consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently

2019-10-17 Thread Jira
Íñigo Goiri created HDFS-14910:
--

 Summary: TestRenameWithSnapshots#testRename2PreDescendant failing 
consistently
 Key: HDFS-14910
 URL: https://issues.apache.org/jira/browse/HDFS-14910
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Íñigo Goiri


TestRenameWithSnapshots#testRename2PreDescendant has been failing consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable

2019-10-17 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated HDFS-14887:

Attachment: 14887.008.png

> RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
> --
>
> Key: HDFS-14887
> URL: https://issues.apache.org/jira/browse/HDFS-14887
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: 14887.008.png, 14887.after.png, 14887.before.png, 
> HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, 
> HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, 
> HDFS-14887.007.patch, HDFS-14887.008.patch
>
>
> In Router Web UI, Observer Namenode Information displaying as Unavailable.
> We should show a proper icon for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope

2019-10-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953902#comment-16953902
 ] 

Íñigo Goiri commented on HDFS-14909:


I see [~ayushtkn] narrowed it down to HDFS-14492.
I'll open a JIRA.

> DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage 
> count for excluded node which is already part of excluded scope 
> -
>
> Key: HDFS-14909
> URL: https://issues.apache.org/jira/browse/HDFS-14909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, 
> HDFS-14909.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

2019-10-17 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953903#comment-16953903
 ] 

Stephen O'Donnell commented on HDFS-14854:
--

Let me have a look at that Configuration change too ...

[~weichiu] has told me he plans to review this when he gets some time. As we 
have had a few review cycles already it should be in good shape for him to have 
a thorough look now.

> Create improved decommission monitor implementation
> ---
>
> Key: HDFS-14854
> URL: https://issues.apache.org/jira/browse/HDFS-14854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, 
> HDFS-14854.011.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope

2019-10-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953898#comment-16953898
 ] 

Íñigo Goiri commented on HDFS-14909:


+1 on  [^HDFS-14909.003.patch].

Does anybody know what's happening with 
TestRenameWithSnapshots.testRename2PreDescendant?
I've seen it failing pretty consistently in the last few days.

> DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage 
> count for excluded node which is already part of excluded scope 
> -
>
> Key: HDFS-14909
> URL: https://issues.apache.org/jira/browse/HDFS-14909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, 
> HDFS-14909.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

2019-10-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953892#comment-16953892
 ] 

Íñigo Goiri commented on HDFS-14854:


Thanks for the changes, [^HDFS-14854.011.patch] looks good to me.
One minor thing, we can use Configuration#getClass() in 
DatanodeAdminManager#136 as it will take care of getting the class an making it 
an interface, etc.

Anybody else up for going over this?

> Create improved decommission monitor implementation
> ---
>
> Key: HDFS-14854
> URL: https://issues.apache.org/jira/browse/HDFS-14854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, 
> HDFS-14854.011.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14810) Review FSNameSystem editlog sync

2019-10-17 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-14810:

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) Review FSNameSystem editlog sync

2019-10-17 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953891#comment-16953891
 ] 

Ayush Saxena commented on HDFS-14810:
-

Committed to trunk.
Thanx [~hexiaoqiao] for the contribution!!!

> Review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2019-10-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953862#comment-16953862
 ] 

Hadoop QA commented on HDFS-14852:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
40s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 59s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
30s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 47s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 24s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}183m 15s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestLowRedundancyBlockQueues |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14852 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12983274/HDFS-14852.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux af919f484f4f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 3990ffa |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28103/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28103/testReport/ |
| Max. process+thread count | 2671 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommi

[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

2019-10-17 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953828#comment-16953828
 ] 

Stephen O'Donnell commented on HDFS-14854:
--

[~elgoiri] I have addressed all the latest comments, I think.

moveBlocksToPending is tricky to make much simpler, as it needs to break out of 
the inner or outer loop on various conditions. However I pulled some of the 
logic into a new method and added a few commends, which I think it improves it 
a bit. Let me know what you think.

> Create improved decommission monitor implementation
> ---
>
> Key: HDFS-14854
> URL: https://issues.apache.org/jira/browse/HDFS-14854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, 
> HDFS-14854.011.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14854) Create improved decommission monitor implementation

2019-10-17 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-14854:
-
Attachment: HDFS-14854.011.patch

> Create improved decommission monitor implementation
> ---
>
> Key: HDFS-14854
> URL: https://issues.apache.org/jira/browse/HDFS-14854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, 
> HDFS-14854.011.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2221) Monitor datanodes in ozoneperf compose cluster

2019-10-17 Thread Marton Elek (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marton Elek resolved HDDS-2221.
---
Fix Version/s: 0.5.0
   Resolution: Fixed

> Monitor datanodes in ozoneperf compose cluster
> --
>
> Key: HDDS-2221
> URL: https://issues.apache.org/jira/browse/HDDS-2221
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: docker
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> ozoneperf compose cluster contains a prometheus but as of now it collects the 
> data only from scm and om.
> We don't know the exact number of datanodes (can be scaled up and down) 
> therefor it's harder to configure the datanode host names. I would suggest to 
> configure the first 10 datanodes (which covers most of the use cases)
> How to test?
> {code:java}
> cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozoneperf
> docker-compose up -d
> firefox http://localhost:9090/targets
>   {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2221) Monitor datanodes in ozoneperf compose cluster

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2221?focusedWorklogId=329888&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329888
 ]

ASF GitHub Bot logged work on HDDS-2221:


Author: ASF GitHub Bot
Created on: 17/Oct/19 14:00
Start Date: 17/Oct/19 14:00
Worklog Time Spent: 10m 
  Work Description: elek commented on pull request #1: HDDS-2221. Monitor 
datanodes in ozoneperf compose cluster
URL: https://github.com/apache/hadoop-ozone/pull/1
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 329888)
Time Spent: 1h 20m  (was: 1h 10m)

> Monitor datanodes in ozoneperf compose cluster
> --
>
> Key: HDDS-2221
> URL: https://issues.apache.org/jira/browse/HDDS-2221
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: docker
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> ozoneperf compose cluster contains a prometheus but as of now it collects the 
> data only from scm and om.
> We don't know the exact number of datanodes (can be scaled up and down) 
> therefor it's harder to configure the datanode host names. I would suggest to 
> configure the first 10 datanodes (which covers most of the use cases)
> How to test?
> {code:java}
> cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozoneperf
> docker-compose up -d
> firefox http://localhost:9090/targets
>   {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1985) Fix listVolumes API

2019-10-17 Thread Marton Elek (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marton Elek updated HDDS-1985:
--
Fix Version/s: 0.5.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Fix listVolumes API
> ---
>
> Key: HDDS-1985
> URL: https://issues.apache.org/jira/browse/HDDS-1985
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This Jira is to fix lisVolumes API in HA code path.
> In HA, we have an in-memory cache, where we put the result to in-memory cache 
> and return the response, later it will be picked by double buffer thread and 
> it will flush to disk. So, now when do listVolumes, it should use both 
> in-memory cache and rocksdb volume table to list volumes for a user.
>  
> No fix is required for this, as the information is retrieved from the MPU Key 
> table, this information is not retrieved through RocksDB Table iteration. (As 
> when we use get() this checks from cache first, and then it checks table)
>  
> Used this Jira to add an integration test to verify the behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-1985) Fix listVolumes API

2019-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1985?focusedWorklogId=329887&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329887
 ]

ASF GitHub Bot logged work on HDDS-1985:


Author: ASF GitHub Bot
Created on: 17/Oct/19 13:57
Start Date: 17/Oct/19 13:57
Worklog Time Spent: 10m 
  Work Description: elek commented on pull request #33: HDDS-1985. Fix 
listVolumes API
URL: https://github.com/apache/hadoop-ozone/pull/33
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 329887)
Time Spent: 0.5h  (was: 20m)

> Fix listVolumes API
> ---
>
> Key: HDDS-1985
> URL: https://issues.apache.org/jira/browse/HDDS-1985
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This Jira is to fix lisVolumes API in HA code path.
> In HA, we have an in-memory cache, where we put the result to in-memory cache 
> and return the response, later it will be picked by double buffer thread and 
> it will flush to disk. So, now when do listVolumes, it should use both 
> in-memory cache and rocksdb volume table to list volumes for a user.
>  
> No fix is required for this, as the information is retrieved from the MPU Key 
> table, this information is not retrieved through RocksDB Table iteration. (As 
> when we use get() this checks from cache first, and then it checks table)
>  
> Used this Jira to add an integration test to verify the behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread Mukul Kumar Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953743#comment-16953743
 ] 

Mukul Kumar Singh commented on HDDS-2318:
-

cc [~arp], [~nanda], [~bharat], [~hanishakoneru]

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread Mukul Kumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh reassigned HDDS-2318:
---

Assignee: Bharat Viswanadham

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-17 Thread Chris Teoh (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953724#comment-16953724
 ] 

Chris Teoh commented on HDDS-1847:
--

I'm unclear on this issue. The bottom 2 keys are HDFS specific, not Ozone, 
changing those keys means affecting HDFS project and not HDDS project?. Can I 
please get more clarification on what is required?

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2301) Write path: Reduce read contention in rocksDB

2019-10-17 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HDDS-2301:
---
Labels: performance  (was: )

> Write path: Reduce read contention in rocksDB
> -
>
> Key: HDDS-2301
> URL: https://issues.apache.org/jira/browse/HDDS-2301
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Rajesh Balamohan
>Assignee: Nanda kumar
>Priority: Major
>  Labels: performance
> Attachments: om_write_profile.png
>
>
> Benchmark: 
>  
>  Simple benchmark which creates 100 and 1000s of keys (empty directory) in 
> OM. This is done in a tight loop and multiple threads from client side to add 
> enough load on CPU. Note that intention is to understand the bottlenecks in 
> OM (intentionally avoiding interactions with SCM & DN).
> Observation:
>  -
>  During write path, Ozone checks {{OMFileRequest.verifyFilesInPath}}. This 
> internally calls {{omMetadataManager.getKeyTable().get(dbKeyName)}} for every 
> write operation. This turns out to be expensive and chokes the write path.
> [https://github.com/apache/hadoop/blob/trunk/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMDirectoryCreateRequest.java#L155]
> [https://github.com/apache/hadoop/blob/trunk/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileRequest.java#L63]
> In most of the cases, directory creation would be fresh entry. In such cases, 
> it would be good to try with {{RocksDB::keyMayExist.}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles

2019-10-17 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HDDS-2318:
---
Labels: performance  (was: )

> Avoid proto::tostring in preconditions to save CPU cycles
> -
>
> Key: HDDS-2318
> URL: https://issues.apache.org/jira/browse/HDDS-2318
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png
>
>
> [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117]
>  
> This ends up converting proto toString in precondition checks and burns CPU 
> cycles. {{request.toString()}} can be added in debug log on need basis.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2309) Optimise OzoneManagerDoubleBuffer::flushTransactions to flush in batches

2019-10-17 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HDDS-2309:
---
Labels: performance  (was: )

> Optimise OzoneManagerDoubleBuffer::flushTransactions to flush in batches
> 
>
> Key: HDDS-2309
> URL: https://issues.apache.org/jira/browse/HDDS-2309
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-10-15 at 4.19.13 PM.png
>
>
> When running a write heavy benchmark, 
> {{{color:#00}org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.flushTransactions{color}}}
>  was invoked for pretty much every write.
> This forces {{cleanupCache}} to be invoked which ends up choking in single 
> thread executor. Attaching the profiler information which gives more details.
> Ideally, {{flushTransactions}} should batch up the work to reduce load on 
> rocksDB.
>  
> [https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.java#L130]
>  
> [https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.java#L322]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >