[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-16 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Attachment: HDFS-13915.003.patch

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch, 
> HDFS-13915.003.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-16 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689199#comment-16689199
 ] 

Jiandan Yang  commented on HDFS-13915:
--

There may be something wrong with Jenkins, uploading [^HDFS-13915.003.patch] to 
retrigger Jenkins

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687599#comment-16687599
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Failed UTs are not caused by  [^HDFS-14045.011.patch], and I can run them 
successfully in my local machine.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687587#comment-16687587
 ] 

Jiandan Yang  commented on HDFS-13915:
--

Uploading [^HDFS-13915.002.patch] to fix checkstyle, whitespace and ut. 
Can anyone please help me to review this patch.

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Attachment: HDFS-13915.002.patch

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687437#comment-16687437
 ] 

Jiandan Yang  edited comment on HDFS-14045 at 11/15/18 3:20 AM:


Hi, [~elgoiri], I see what you mean.
{quote}
For the Unknown-Unknown, I'm not sure is worth showing them, we should just not 
store those, what we had already covered this.
{quote}
Absolutely right, there is no need to metric for the Unknown-Unknown
{quote}
For the ns0-Unknown, we should just make it ns0.
{quote}
This is a very good suggestion and thank you very much.
I've updated code according to your comments in  [^HDFS-14045.011.patch] 


was (Author: yangjiandan):
Hi, [~elgoiri], I see what you mean.
{quote}
For the Unknown-Unknown, I'm not sure is worth showing them, we should just not 
store those, what we had already covered this.
{quote}
Absolutely right, there is no need to metric for the Unknown-Unknown
{quote}
For the ns0-Unknown, we should just make it ns0.
{quote}
This is a very good suggestion and thank you very much.
I've update code according to your comments in  [^HDFS-14045.011.patch] 

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.011.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687437#comment-16687437
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~elgoiri], I see what you mean.
{quote}
For the Unknown-Unknown, I'm not sure is worth showing them, we should just not 
store those, what we had already covered this.
{quote}
Absolutely right, there is no need to metric for the Unknown-Unknown
{quote}
For the ns0-Unknown, we should just make it ns0.
{quote}
This is a very good suggestion and thank you very much.
I've update code according to your comments in  [^HDFS-14045.011.patch] 

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch, HDFS-14045.010.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Attachment: (was: HDFS-13915.001.patch)

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Attachment: HDFS-13915.001.patch

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Assignee: Jiandan Yang 
  Status: Patch Available  (was: Open)

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Attachment: HDFS-13915.001.patch

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13915.001.patch
>
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686177#comment-16686177
 ] 

Jiandan Yang  edited comment on HDFS-13915 at 11/14/18 8:01 AM:


I add a case  in [^HDFS-13915.001.patch] based on trunk to reproduce issue. 
HI, [~szetszwo]  BlockStoragePolicy#chooseStorageTypes may return excessive 
storageType, and I do not understander why after looking through related code. 
Can we remove excessive storageType?

{code:java}
if (storageTypes.size() < expectedSize) {
  LOG.warn("Failed to place enough replicas: expected size is {}"
  + " but only {} storage types can be selected (replication={},"
  + " selected={}, unavailable={}" + ", removed={}" + ", policy={}"
  + ")", expectedSize, storageTypes.size(), replication, storageTypes,
  unavailables, removed, this);
} else if (storageTypes.size() > expectedSize) {
  //should remove excess storageType to return expectedSize storageType 
  int storageTypesSize = storageTypes.size();
  int excessiveStorageTypeNum = storageTypesSize - expectedSize;
  for (int i = 0; i < excessiveStorageTypeNum; i++) {
storageTypes.remove(storageTypesSize - 1 - i);
  }
}
{code}




was (Author: yangjiandan):
I add a case  in [^HDFS-13915.001.patch] based on trunk to reproduce issue. 
HI, [~szetszwo]  BlockStoragePolicy#chooseStorageTypes may return excessive 
storageType, and I do not understander why after looking through related code. 
Can we remove excessive storageType?

{code:java}
if (storageTypes.size() < expectedSize) {
  LOG.warn("Failed to place enough replicas: expected size is {}"
  + " but only {} storage types can be selected (replication={},"
  + " selected={}, unavailable={}" + ", removed={}" + ", policy={}"
  + ")", expectedSize, storageTypes.size(), replication, storageTypes,
  unavailables, removed, this);
} else if (storageTypes.size() > expectedSize) {
//should remove excess storageType to return expectedSize storageType
}
{code}



> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + 

[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686177#comment-16686177
 ] 

Jiandan Yang  commented on HDFS-13915:
--

I add a case  in [^HDFS-13915.001.patch] based on trunk to reproduce issue. 
HI, [~szetszwo]  BlockStoragePolicy#chooseStorageTypes may return excessive 
storageType, and I do not understander why after looking through related code. 
Can we remove excessive storageType?

{code:java}
if (storageTypes.size() < expectedSize) {
  LOG.warn("Failed to place enough replicas: expected size is {}"
  + " but only {} storage types can be selected (replication={},"
  + " selected={}, unavailable={}" + ", removed={}" + ", policy={}"
  + ")", expectedSize, storageTypes.size(), replication, storageTypes,
  unavailables, removed, this);
} else if (storageTypes.size() > expectedSize) {
//should remove excess storageType to return expectedSize storageType
}
{code}



> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode 

[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686140#comment-16686140
 ] 

Jiandan Yang  edited comment on HDFS-14045 at 11/14/18 6:49 AM:


There are many "[ERROR] Error occurred in starting fork, check output in log" 
in test log,  and I think there may be something wrong with Jenkins.

Uploading [^HDFS-14045.010.patch] to trigger Jenkins.


was (Author: yangjiandan):
There are many "[ERROR] Error occurred in starting fork, check output in log" 
in test log,  and I think there may be something wrong with Jenkins.

Uploading [^HDFS-14045] to trigger Jenkins.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686140#comment-16686140
 ] 

Jiandan Yang  commented on HDFS-14045:
--

There are many "[ERROR] Error occurred in starting fork, check output in log" 
in test log,  and I think there may be something wrong with Jenkins.

Uploading [^HDFS-14045] to trigger Jenkins.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.010.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch, HDFS-14045.010.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134
 ] 

Jiandan Yang  edited comment on HDFS-14045 at 11/14/18 6:41 AM:


Thanks [~elgoiri] for you comments.
{quote}TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), 
testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to 
extract the suffix.
{quote}
I've remove suffix in [^HDFS-14045.009.patch]
{quote}I'm not sure about the Unknown-Unknown behavior, if we cannot determine 
the id, we may want to just leave it as it was?
{quote}
Do you mean do not make metrics when suffix is Unknown-Unknown?I do not 
understand what you mean.
{quote}Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime 
are still showing the old values? It looks good but just to verify.
{quote}
A good suggestion, I've add verification about HeartbeatsNumOps in 
[^HDFS-14045.009.patch]


was (Author: yangjiandan):
Thanks [~elgoiri] for you comments.
{quote}
TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), 
testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to 
extract the suffix.
{quote}
I've remove suffix in [^HDFS-14045.009.patch]
{quote}
 I'm not sure about the Unknown-Unknown behavior, if we cannot determine the 
id, we may want to just leave it as it was?
{quote}
Do you mean do not make metrics when suffix is Unknown-Unknown?I do not 
understand what your mean.
{quote}
Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are 
still showing the old values? It looks good but just to verify.
{quote}
A good suggestion, I've add verification about HeartbeatsNumOps in  
[^HDFS-14045.009.patch]

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134
 ] 

Jiandan Yang  edited comment on HDFS-14045 at 11/14/18 6:38 AM:


Thanks [~elgoiri] for you comments.
{quote}
TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), 
testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to 
extract the suffix.
{quote}
I've remove suffix in [^HDFS-14045.009.patch]
{quote}
 I'm not sure about the Unknown-Unknown behavior, if we cannot determine the 
id, we may want to just leave it as it was?
{quote}
Do you mean do not make metrics when suffix is Unknown-Unknown?I do not 
understand what your mean.
{quote}
Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are 
still showing the old values? It looks good but just to verify.
{quote}
A good suggestion, I've add verification about HeartbeatsNumOps in  
[^HDFS-14045.009.patch]


was (Author: yangjiandan):
Thanks [~elgoiri] for you comments.
{quota}
TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), 
testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to 
extract the suffix.
{quota}
I've remove suffix in [^HDFS-14045.009.patch]
{quota}
 I'm not sure about the Unknown-Unknown behavior, if we cannot determine the 
id, we may want to just leave it as it was?
{quota}
Do you mean do not make metrics when suffix is Unknown-Unknown?I do not 
understand what your mean.
{quota}
Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are 
still showing the old values? It looks good but just to verify.
{quota}
A good suggestion, I've add verification about HeartbeatsNumOps in  
[^HDFS-14045.009.patch]

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Thanks [~elgoiri] for you comments.
{quota}
TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), 
testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to 
extract the suffix.
{quota}
I've remove suffix in [^HDFS-14045.009.patch]
{quota}
 I'm not sure about the Unknown-Unknown behavior, if we cannot determine the 
id, we may want to just leave it as it was?
{quota}
Do you mean do not make metrics when suffix is Unknown-Unknown?I do not 
understand what your mean.
{quota}
Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are 
still showing the old values? It looks good but just to verify.
{quota}
A good suggestion, I've add verification about HeartbeatsNumOps in  
[^HDFS-14045.009.patch]

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-13 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.009.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, 
> HDFS-14045.009.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-12 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684619#comment-16684619
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~xkrogen] 
I have updated patch according to your review comments, please help reviewing 
again.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-10 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682726#comment-16682726
 ] 

Jiandan Yang  commented on HDFS-14045:
--

TestNameNodeMXBean running failed is not related with this patch, I run 
successfully in my local machine.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-10 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.008.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-10 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682395#comment-16682395
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Thanks [~xkrogen] for your comments very much.
{quote}
Can we change the same of the method/parameter to something indicating it is 
for metrics only, maybe like nnLatencyMetricsSuffix? It looks particularly odd 
to me in IncrementalBlockReportManager right now.
{quote}
I rename {{nnLatencyMetricsSuffix}} into {{rpcMetricSuffix}},  what do you 
think of this name?
{quote}
I think I would prefer to see the existing methods in DataNodeMetrics changed 
to update both metrics, rather than the caller having to remember to call both 
methods. It introduces less possibility for the two metrics to get out of sync 
later.
{quote}
Very good suggestion, I have changed to update both metrics at one method in 
patch008, but serviceId-nnId is needed when updating metric, so there is need 
to add a parameter as suffix of metrics in the existing methods.
{quote}
I'm not sure if you should re-use the same MutableRatesWithAggregation for all 
of the metrics. It seems cleaner to me to have one per metric type, e.g. one 
for heartbeats, one for lifeline, and so on, but let me know if you disagree. I 
think this may even make it so that, if you set up the names correctly, the 
MutableRatesWithAggregation can replace the existing MutableRate while 
maintaining the name of the metric. Not 100% sure on this.
{quote}
I prefer to re-use MutableRatesWithAggregation for simplicity, it does not need 
to add fields when adding new metrics.
{quote}
You should update Metrics.md documenting these new metrics
{quote}
Thanks for reminding to modify Metrics.md,  and newly added metrics have been 
written to Metrics.md in patch008

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-09 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681024#comment-16681024
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Failed ut is not related with this patch, I run successfully in my local 
machine.
[~cheersyang],[~elgoiri],[~xkrogen] Would you please help me reviewing again?

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.006.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.007.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.005.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679222#comment-16679222
 ] 

Jiandan Yang  edited comment on HDFS-14045 at 11/8/18 3:01 AM:
---

Hi, [~elgoiri], I agree with you very much.
Setting interval and unit has better readability, I'll update in patch5


was (Author: yangjiandan):
Hi, [~elgoiri], I do agree with you.
Setting interval and unit has better readability, I'll update in patch5

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679413#comment-16679413
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~cheersyang] and [~xkrogen], thanks for your suggestions very much.
I add nameServiceId and NameNodeId into names of metrics dynamically, and I 
keep old metrics for compatibility.
Please help me review patch6.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-08 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679807#comment-16679807
 ] 

Jiandan Yang  commented on HDFS-14045:
--

There may be something wrong in jenkins.
Fix checkstyle error and upload patch007 to retrigger jenkins.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, 
> HDFS-14045.006.patch, HDFS-14045.007.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-07 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679222#comment-16679222
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~elgoiri], I do agree with you.
Setting interval and unit has better readability, I'll update in patch5

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-06 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677737#comment-16677737
 ] 

Jiandan Yang  commented on HDFS-13984:
--

upload patch3 to trigger running test

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch, 
> HDFS-13984.003.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-06 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Attachment: HDFS-13984.003.patch

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch, 
> HDFS-13984.003.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-06 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677695#comment-16677695
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Thanks [~xkrogen] for your suggest and reminding about multiple standbys.
At first I thought getting active latency is enough because standy does not 
provide service for users, and after reading your comments I realize  it's 
helpful to monitor observer if it also offer service, but I think grouping 
metrics by role of NN is more better than by NameNode ID, because we can not 
know which metric is Active/Standy/Observer from name of metric.


> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-06 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.004.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch, HDFS-14045.004.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-06 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677647#comment-16677647
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~elgoiri]
* Sorry for forgetting to fix checkstyle issue, I'll fix this problem.
* The unit of dfs.heartbeat.interval is *second* instead of millisecond
* Your guess is exactly right and I add assert in the unit.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-05 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676151#comment-16676151
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi [~cheersyang], {{BPOfferService#updateActorStatesFromHeartbeat}} use 
HAServiceState to define which NN is active, and keep the active actor in field 
{{bpServiceToActive}}, so we can use {{bpServiceToActive}} directly.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-05 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676077#comment-16676077
 ] 

Jiandan Yang  commented on HDFS-13984:
--

Hi, [~jzhuge]

The failed unit is not caused by this patch, and [~anatoli.shein] is resolving 
in HDFS-14047.

I am confused about compiler warning,   hdfs.c:3484 and libhdfs_wrapper.c:21 is 
not introduced by this patch.

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-05 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.003.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, 
> HDFS-14045.003.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-05 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676059#comment-16676059
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Thanks [~elgoiri] for your review comments.
 * I'v add javadoc to {{BPServiceActor#isActive()}} and remove invalid comment 
in {{testNNRpcMetricsWithHA}} in patch003.
 * About UT what I explain is:
 ** The purpose to set heartbeat interval to 3000 second is to prevent 
bpServiceActor sends heartbeat periodically to NN during running test case, and 
bpServiceActor only sends heartbeat once after startup if we don't trigger 
heartbeat.
 ** The purpose to trigger heartbeat twice is to get active NN from heartbeat 
response and update metrics. The first trigger is to get active NN from 
heartbeat response after one NN transition to active, so one of two service 
actors will be update to active actor; The second trigger is to update 
HeartbeatsNumOps and HeartbeatsForStandbyNumOps by sending heartbeat to Active 
and Standby

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-05 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.002.patch

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-05 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675084#comment-16675084
 ] 

Jiandan Yang  commented on HDFS-14045:
--

Hi, [~cheersyang],[~elgoiri] and [~xkrogen]
Thanks very much for your review and comments. 

To [~cheersyang] 
This patch works under HA,  non-HA situation and federation.
Different clusters have different namespaceID, so namespaceID is not suitable 
as metric name. 
I think the most reasonable method is add some tags in metric level, but only 
metric-source level can allow to add tags, so I use different metric. 
Fortunately, there is no need to add too many metric.

To [~xkrogen] and [~elgoiri]
I'v added some ut about HA and Non-HA in patch002, could you please help 
reviewing this again?.

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-02 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672683#comment-16672683
 ] 

Jiandan Yang  commented on HDFS-14045:
--

[~cheersyang] Would you please help me review this patch?

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-02 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672654#comment-16672654
 ] 

Jiandan Yang  commented on HDFS-13984:
--

[~jzhuge] Could you help me review this patch?

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-01 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-14045:
-
Attachment: HDFS-14045.001.patch
Status: Patch Available  (was: Open)

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-14045.001.patch
>
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-01 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  reassigned HDFS-14045:


Assignee: Jiandan Yang 

> Use different metrics in DataNode to better measure latency of 
> heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
> --
>
> Key: HDFS-14045
> URL: https://issues.apache.org/jira/browse/HDFS-14045
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>
> Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
> Active and Standby usually have different performance at the same time, 
> especially in large cluster. For example, rpc latency of Standby is very long 
> when Standby is catching up editlog. We may misunderstand the state of HDFS. 
> Using different metrics for Active and standby can help us obtain more 
> precise metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN

2018-11-01 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-14045:


 Summary: Use different metrics in DataNode to better measure 
latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
 Key: HDFS-14045
 URL: https://issues.apache.org/jira/browse/HDFS-14045
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Jiandan Yang 


Currently DataNode uses same metrics to measure rpc latency of NameNode, but 
Active and Standby usually have different performance at the same time, 
especially in large cluster. For example, rpc latency of Standby is very long 
when Standby is catching up editlog. We may misunderstand the state of HDFS. 
Using different metrics for Active and standby can help us obtain more precise 
metric data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-01 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Attachment: HDFS-13984.002.patch

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-11-01 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Attachment: (was: HDFS-13984.002.patch)

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Attachment: HDFS-13984.002.patch

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Description: 
getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
*FileSystem#getFileStatus*.  *FileSystem#exists* also call 
*FileSystem#getFileStatus*, just as follows:
{code:java}
  public boolean exists(Path f) throws IOException {
try {
  return getFileStatus(f) != null;
} catch (FileNotFoundException e) {
  return false;
}
  }
{code}

and finally this leads to call NameNodeRpcServer#getFileInfo twice.
Actually we can implement by calling once.


  was:
getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
*FileSystem#getFileStatus*.  *FileSystem#exists* also call 
*FileSystem#getFileStatus, just as follows:
{code:java}
  public boolean exists(Path f) throws IOException {
try {
  return getFileStatus(f) != null;
} catch (FileNotFoundException e) {
  return false;
}
  }
{code}

and finally this leads to call NameNodeRpcServer#getFileInfo twice.
Actually we can implement by calling once.



> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  *FileSystem#exists* also call 
> *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Description: 
getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
*FileSystem#getFileStatus*.  

*FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
{code:java}
  public boolean exists(Path f) throws IOException {
try {
  return getFileStatus(f) != null;
} catch (FileNotFoundException e) {
  return false;
}
  }
{code}

and finally this leads to call NameNodeRpcServer#getFileInfo twice.
Actually we can implement by calling once.


  was:
getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
*FileSystem#getFileStatus*.  *FileSystem#exists* also call 
*FileSystem#getFileStatus*, just as follows:
{code:java}
  public boolean exists(Path f) throws IOException {
try {
  return getFileStatus(f) != null;
} catch (FileNotFoundException e) {
  return false;
}
  }
{code}

and finally this leads to call NameNodeRpcServer#getFileInfo twice.
Actually we can implement by calling once.



> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  
> *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Status: Patch Available  (was: Open)

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  *FileSystem#exists* also call 
> *FileSystem#getFileStatus, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13984:
-
Attachment: HDFS-13984.001.patch

> getFileInfo of libhdfs call NameNode#getFileStatus twice
> 
>
> Key: HDFS-13984
> URL: https://issues.apache.org/jira/browse/HDFS-13984
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-13984.001.patch
>
>
> getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
> *FileSystem#getFileStatus*.  *FileSystem#exists* also call 
> *FileSystem#getFileStatus, just as follows:
> {code:java}
>   public boolean exists(Path f) throws IOException {
> try {
>   return getFileStatus(f) != null;
> } catch (FileNotFoundException e) {
>   return false;
> }
>   }
> {code}
> and finally this leads to call NameNodeRpcServer#getFileInfo twice.
> Actually we can implement by calling once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice

2018-10-11 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-13984:


 Summary: getFileInfo of libhdfs call NameNode#getFileStatus twice
 Key: HDFS-13984
 URL: https://issues.apache.org/jira/browse/HDFS-13984
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: libhdfs
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 


getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls 
*FileSystem#getFileStatus*.  *FileSystem#exists* also call 
*FileSystem#getFileStatus, just as follows:
{code:java}
  public boolean exists(Path f) throws IOException {
try {
  return getFileStatus(f) != null;
} catch (FileNotFoundException e) {
  return false;
}
  }
{code}

and finally this leads to call NameNodeRpcServer#getFileInfo twice.
Actually we can implement by calling once.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-09-16 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617092#comment-16617092
 ] 

Jiandan Yang  commented on HDFS-13915:
--

[~hexiaoqiao] 2.6.5, and I found trunk also have the same problem after 
checking code.

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-09-13 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  reassigned HDFS-13915:


Assignee: (was: Jiandan Yang )

> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Priority: Major
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> {code:java}
>   @Test
>   public void testAllSSDFallbackAndNonNewBlock() {
> final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
> List storageTypes = allSSD.chooseStorageTypes((short) 3,
> Arrays.asList(StorageType.DISK, StorageType.SSD),
> EnumSet.noneOf(StorageType.class), false);
> assertEquals(2, storageTypes.size());
> assertEquals(StorageType.SSD, storageTypes.get(0));
> assertEquals(StorageType.SSD, storageTypes.get(1));
>   }
> {code}
> 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 
> 2 and choose additional two datanodes
> 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client
> 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
> throw IOException, and finally lead to write failed
> {code:java}
> private int findNewDatanode(final DatanodeInfo[] original
>   ) throws IOException {
> if (nodes.length != original.length + 1) {
>   throw new IOException(
>   "Failed to replace a bad datanode on the existing pipeline "
>   + "due to no more good datanodes being available to try. "
>   + "(Nodes: current=" + Arrays.asList(nodes)
>   + ", original=" + Arrays.asList(original) + "). "
>   + "The current failed datanode replacement policy is "
>   + dfsClient.dtpReplaceDatanodeOnFailure
>   + ", and a client may configure this via '"
>   + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>   + "' in its configuration.");
> }
> for(int i = 0; i < nodes.length; i++) {
>   int j = 0;
>   for(; j < original.length && !nodes[i].equals(original[j]); j++);
>   if (j == original.length) {
> return i;
>   }
> }
> throw new IOException("Failed: new datanode not found: nodes="
> + Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
>   }
> {code}
> client warn logs is:
>  {code:java}
> WARN [DataStreamer for file 
> /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
>  block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
>  
> DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
>  
> original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
>  
> DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-09-13 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Description: 
Consider following situation:
1. create a file with ALLSSD policy

2. return [SSD,SSD,DISK] due to lack of SSD space

3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
pipeline and replacing bad datanode

4. BlockPlacementPolicyDefault#chooseTarget will call 
StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
chooseStorageTypes return [SSD,SSD]
{code:java}
  @Test
  public void testAllSSDFallbackAndNonNewBlock() {
final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD);
List storageTypes = allSSD.chooseStorageTypes((short) 3,
Arrays.asList(StorageType.DISK, StorageType.SSD),
EnumSet.noneOf(StorageType.class), false);
assertEquals(2, storageTypes.size());
assertEquals(StorageType.SSD, storageTypes.get(0));
assertEquals(StorageType.SSD, storageTypes.get(1));
  }
{code}

5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 
and choose additional two datanodes

6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client

7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
throw IOException, and finally lead to write failed
{code:java}
private int findNewDatanode(final DatanodeInfo[] original
  ) throws IOException {
if (nodes.length != original.length + 1) {
  throw new IOException(
  "Failed to replace a bad datanode on the existing pipeline "
  + "due to no more good datanodes being available to try. "
  + "(Nodes: current=" + Arrays.asList(nodes)
  + ", original=" + Arrays.asList(original) + "). "
  + "The current failed datanode replacement policy is "
  + dfsClient.dtpReplaceDatanodeOnFailure
  + ", and a client may configure this via '"
  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
  + "' in its configuration.");
}
for(int i = 0; i < nodes.length; i++) {
  int j = 0;
  for(; j < original.length && !nodes[i].equals(original[j]); j++);
  if (j == original.length) {
return i;
  }
}
throw new IOException("Failed: new datanode not found: nodes="
+ Arrays.asList(nodes) + ", original=" + Arrays.asList(original));
  }
{code}
client warn logs is:
 {code:java}

WARN [DataStreamer for file 
/home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
 
DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
 
original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
{code}

  was:
Consider following situation:
1. create a file with ALLSSD policy

2. return [SSD,SSD,DISK] due to lack of SSD space

3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
pipeline and replacing bad datanode

4. BlockPlacementPolicyDefault#chooseTarget will call 
StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
chooseStorageTypes return [SSD,SSD]

5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 
and choose additional two datanodes

6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client

7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
throw IOException, and finally lead to write failed

client warn logs is:
 {code:java}

WARN [DataStreamer for file 
/home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],

[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-09-13 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-13915:
-
Description: 
Consider following situation:
1. create a file with ALLSSD policy

2. return [SSD,SSD,DISK] due to lack of SSD space

3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
pipeline and replacing bad datanode

4. BlockPlacementPolicyDefault#chooseTarget will call 
StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
chooseStorageTypes return [SSD,SSD]

5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 
and choose additional two datanodes

6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client

7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
throw IOException, and finally lead to write failed

client warn logs is:
 {code:java}

WARN [DataStreamer for file 
/home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
 
DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
 
original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
{code}

  was:
Consider following situation:
1. create a file with ALLSSD policy

2. return [SSD,SSD,DISK] due to lack of SSD space

3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
pipeline and replacing bad datanode

4. BlockPlacementPolicyDefault#chooseTarget will call 
StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
chooseStorageTypes return [SSD,SSD]

5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 
and choose additional two datanodes

6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client

7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
throw IOException, and finally lead to write failed

client warn logs is:
 \{code:java}

WARN [DataStreamer for file 
/home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
 
DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
 
original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.

{code}


> replace datanode failed because of  NameNodeRpcServer#getAdditionalDatanode 
> returning excessive datanodeInfo
> 
>
> Key: HDFS-13915
> URL: https://issues.apache.org/jira/browse/HDFS-13915
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>
> Consider following situation:
> 1. create a file with ALLSSD policy
> 2. return [SSD,SSD,DISK] due to lack of SSD space
> 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
> pipeline and replacing bad datanode
> 4. BlockPlacementPolicyDefault#chooseTarget will call 
> StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
> chooseStorageTypes return [SSD,SSD]
> 5. do numOfReplicas = 

[jira] [Created] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo

2018-09-13 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-13915:


 Summary: replace datanode failed because of  
NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
 Key: HDFS-13915
 URL: https://issues.apache.org/jira/browse/HDFS-13915
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
 Environment: 

Reporter: Jiandan Yang 
Assignee: Jiandan Yang 


Consider following situation:
1. create a file with ALLSSD policy

2. return [SSD,SSD,DISK] due to lack of SSD space

3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write 
pipeline and replacing bad datanode

4. BlockPlacementPolicyDefault#chooseTarget will call 
StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but 
chooseStorageTypes return [SSD,SSD]

5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 
and choose additional two datanodes

6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client

7. DataStreamer#findNewDatanode find nodes.length != original.length + 1  and 
throw IOException, and finally lead to write failed

client warn logs is:
 \{code:java}

WARN [DataStreamer for file 
/home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545
 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception

java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
 
DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]],
 
original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD],
 
DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-04 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385686#comment-16385686
 ] 

Jiandan Yang  commented on HDFS-9666:
-

[~jzhuge] [~vinayrpet] Could you help me review it.

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383667#comment-16383667
 ] 

Jiandan Yang  commented on HDFS-9666:
-

Failed UTs are not caused by this patch, and I run these UTs successfully in my 
local machine.
[~vinodkv] [~arpitagarwal] Please help me review this patch, Thanks.

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383359#comment-16383359
 ] 

Jiandan Yang  commented on HDFS-9666:
-

upload v4 patch:
set refetchIfRequired=true when chooseDataNode in fetchBlockByteRange and fix 
ut error

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-9666:

Attachment: HDFS-9666.004.patch

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-9666:

Attachment: HDFS-9666.003.patch

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch, HDFS-9666.003.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383168#comment-16383168
 ] 

Jiandan Yang  commented on HDFS-9666:
-

fix compiler error and upload v2 patch

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-9666:

Attachment: HDFS-9666.002.patch

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, 
> HDFS-9666.002.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-9666:

Status: Patch Available  (was: Open)

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0, 2.6.0
>Reporter: ade
>Assignee: ade
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  reassigned HDFS-9666:
---

Assignee: Jiandan Yang   (was: ade)

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383083#comment-16383083
 ] 

Jiandan Yang  commented on HDFS-9666:
-

upload v1 patch based trunk

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: ade
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2018-03-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-9666:

Attachment: HDFS-9666.001.patch

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: ade
>Priority: Major
> Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11942) make new chooseDataNode policy work in more operation like seek, fetch

2018-02-27 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379902#comment-16379902
 ] 

Jiandan Yang  commented on HDFS-11942:
--

[~whisper_deng] This patch is very important for HBase, Why not keep on going?

> make new  chooseDataNode policy  work in more operation like seek, fetch
> 
>
> Key: HDFS-11942
> URL: https://issues.apache.org/jira/browse/HDFS-11942
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0, 3.0.0-alpha3
>Reporter: Fangyuan Deng
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: HDFS-11942.0.patch, HDFS-11942.1.patch, 
> ssd-first-disable(default).png, ssd-first-enable.png
>
>
> in default policy, if a file is ONE_SSD, client will prior read the local 
> disk replica rather than the remote ssd replica.
> but now, the pci-e SSD and 10G ethernet make remote read SSD more faster than 
>  the local disk.
> HDFS-9666 give us a patch,  but the code is not complete and not updated for 
> a long time.
> this sub-task issue give a complete patch and 
> we have tested on three machines [ 32 core cpu, 128G mem , 1000M network, 
> 1.2T HDD, 800G SSD(intel P3600) ].
> with this feather, throughput of hbase table(ONE_SSD) is double of which 
> without this feather



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver

2017-11-15 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12814:
-
Attachment: HDFS-12814.002.patch

> Add blockId when warning slow mirror/disk in BlockReceiver
> --
>
> Key: HDFS-12814
> URL: https://issues.apache.org/jira/browse/HDFS-12814
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: HDFS-12814.001.patch, HDFS-12814.002.patch
>
>
> HDFS-11603 add downstream DataNodeIds and volume path.
> In order to better debug, those warnning log should include blockId



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver

2017-11-15 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254631#comment-16254631
 ] 

Jiandan Yang  commented on HDFS-12814:
--

Thanks [~cheersyang] and [~msingh] for review. I will add the comma delimiter 
and upload v2 patch

> Add blockId when warning slow mirror/disk in BlockReceiver
> --
>
> Key: HDFS-12814
> URL: https://issues.apache.org/jira/browse/HDFS-12814
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: HDFS-12814.001.patch
>
>
> HDFS-11603 add downstream DataNodeIds and volume path.
> In order to better debug, those warnning log should include blockId



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver

2017-11-14 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12814:
-
Status: Patch Available  (was: Open)

> Add blockId when warning slow mirror/disk in BlockReceiver
> --
>
> Key: HDFS-12814
> URL: https://issues.apache.org/jira/browse/HDFS-12814
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: HDFS-12814.001.patch
>
>
> HDFS-11603 add downstream DataNodeIds and volume path.
> In order to better debug, those warnning log should include blockId



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver

2017-11-14 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12814:
-
Attachment: HDFS-12814.001.patch

> Add blockId when warning slow mirror/disk in BlockReceiver
> --
>
> Key: HDFS-12814
> URL: https://issues.apache.org/jira/browse/HDFS-12814
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: HDFS-12814.001.patch
>
>
> HDFS-11603 add downstream DataNodeIds and volume path.
> In order to better debug, those warnning log should include blockId



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver

2017-11-14 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-12814:


 Summary: Add blockId when warning slow mirror/disk in BlockReceiver
 Key: HDFS-12814
 URL: https://issues.apache.org/jira/browse/HDFS-12814
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 
Priority: Minor


HDFS-11603 add downstream DataNodeIds and volume path.
In order to better debug, those warnning log should include blockId



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-14 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859
 ] 

Jiandan Yang  edited comment on HDFS-12754 at 11/15/17 2:41 AM:


[~xiaochen] Thank you for reviewing. 

{quote}The fix here is to close the output streams out of the lease renewer 
lock{quote}

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}





was (Author: yangjiandan):
[~xiaochen] Thank you for reviewing. 

@ The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}




> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, 
> HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, 
> HDFS-12754.006.patch, HDFS-12754.007.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-14 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859
 ] 

Jiandan Yang  commented on HDFS-12754:
--

[~xiaochen] Thank you for reviewing. 

The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}




> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, 
> HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, 
> HDFS-12754.006.patch, HDFS-12754.007.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-14 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859
 ] 

Jiandan Yang  edited comment on HDFS-12754 at 11/15/17 2:35 AM:


[~xiaochen] Thank you for reviewing. 

@ The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}





was (Author: yangjiandan):
[~xiaochen] Thank you for reviewing. 

@The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}




> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, 
> HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, 
> HDFS-12754.006.patch, HDFS-12754.007.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-14 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859
 ] 

Jiandan Yang  edited comment on HDFS-12754 at 11/15/17 2:34 AM:


[~xiaochen] Thank you for reviewing. 

@The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}





was (Author: yangjiandan):
[~xiaochen] Thank you for reviewing. 

The fix here is to close the output streams out of the lease renewer lock

I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold  
{{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock  at the same 
time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block.  
{{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets 
{{DFSOutputStream}} object lock and releases.

{code:java}
synchronized (this) {
DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout();
dfsclientsCopy = new ArrayList<>(dfsclients);
dfsclients.clear();
//Expire the current LeaseRenewer thread.
emptyTime = 0;
Factory.INSTANCE.remove(LeaseRenewer.this);
  }
  for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
  }
{code}




> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, 
> HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, 
> HDFS-12754.006.patch, HDFS-12754.007.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-08 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243633#comment-16243633
 ] 

Jiandan Yang  commented on HDFS-12754:
--

v2 patch fixes dead lock and looks good to me, [~elgoiri] [~cheersyang] 
[~kihwal] Do you have any opinion?

> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock

2017-11-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237063#comment-16237063
 ] 

Jiandan Yang  commented on HDFS-12754:
--

Hi, [~kshukla], thanks for you patch. This patch could resolve dead lock, but I 
have two comments about your patch:
1. DFSOutputStream#clientClosed can be removed, I think it's duplicated with 
DFSOutputStream#closed.
2. When LeaseRenewer#run(int)  catch SocketTimeoutException and  run the follow 
code after clean LeaseRenewer#dfsclients, dfsclient may create file and add a 
new one to the LeaseRenewer#dfsclients
{code:java}
for (DFSClient dfsClient : dfsclientsCopy) {
dfsClient.closeAllFilesBeingWritten(true);
}
{code}


> Lease renewal can hit a deadlock 
> -
>
> Key: HDFS-12754
> URL: https://issues.apache.org/jira/browse/HDFS-12754
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Attachments: HDFS-12754.001.patch
>
>
> The Client and the renewer can hit a deadlock during close operation since 
> closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is 
> possible if the client class close when the renewer is renewing a lease.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException

2017-11-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12757:
-
Attachment: HDFS-12757.patch

upload a UT to reproduce issue

> DeadLock Happened Between DFSOutputStream and LeaseRenewer when 
> LeaseRenewer#renew SocketTimeException
> --
>
> Key: HDFS-12757
> URL: https://issues.apache.org/jira/browse/HDFS-12757
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: HDFS-12757.patch
>
>
> Java stack is :
> {code:java}
> Found one Java-level deadlock:
> =
> "Topology-2 (735/2000)":
>   waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer),
>   which is held by "LeaseRenewer:admin@na61storage"
> "LeaseRenewer:admin@na61storage":
>   waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a 
> org.apache.hadoop.hdfs.DFSOutputStream),
>   which is held by "Topology-2 (735/2000)"
> Java stack information for the threads listed above:
> ===
> "Topology-2 (735/2000)":
> at 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227)
> - waiting to lock <0x0005d3521078> (a 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
> at 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467)
> at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848)
> - locked <0x0005ec0dfa88> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
> - locked <0x0005ec0dfa88> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
> ..
> "LeaseRenewer:admin@na61storage":
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750)
> - waiting to lock <0x0005ec0dfa88> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586)
> at 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453)
> - locked <0x0005d3521078> (a 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
> at 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76)
> at 
> org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310)
> at java.lang.Thread.run(Thread.java:834)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException

2017-11-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12757:
-
Description: 
Java stack is :

{code:java}
Found one Java-level deadlock:
=
"Topology-2 (735/2000)":
  waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer),
  which is held by "LeaseRenewer:admin@na61storage"
"LeaseRenewer:admin@na61storage":
  waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a 
org.apache.hadoop.hdfs.DFSOutputStream),
  which is held by "Topology-2 (735/2000)"

Java stack information for the threads listed above:
===
"Topology-2 (735/2000)":
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227)
- waiting to lock <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86)
at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467)
at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479)
at 
org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
..
"LeaseRenewer:admin@na61storage":
at 
org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750)
- waiting to lock <0x0005ec0dfa88> (a 
org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453)
- locked <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310)
at java.lang.Thread.run(Thread.java:834)

Found 1 deadlock.
{code}

  was:
Java stack is :
Found one Java-level deadlock:
=
"Topology-2 (735/2000)":
  waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer),
  which is held by "LeaseRenewer:admin@na61storage"
"LeaseRenewer:admin@na61storage":
  waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a 
org.apache.hadoop.hdfs.DFSOutputStream),
  which is held by "Topology-2 (735/2000)"

Java stack information for the threads listed above:
===
"Topology-2 (735/2000)":
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227)
- waiting to lock <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86)
at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467)
at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479)
at 
org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
..
"LeaseRenewer:admin@na61storage":
at 
org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750)
- waiting to lock <0x0005ec0dfa88> (a 
org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453)
- locked <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 

[jira] [Created] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException

2017-11-02 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-12757:


 Summary: DeadLock Happened Between DFSOutputStream and 
LeaseRenewer when LeaseRenewer#renew SocketTimeException
 Key: HDFS-12757
 URL: https://issues.apache.org/jira/browse/HDFS-12757
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Jiandan Yang 
Priority: Major


Java stack is :
Found one Java-level deadlock:
=
"Topology-2 (735/2000)":
  waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer),
  which is held by "LeaseRenewer:admin@na61storage"
"LeaseRenewer:admin@na61storage":
  waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a 
org.apache.hadoop.hdfs.DFSOutputStream),
  which is held by "Topology-2 (735/2000)"

Java stack information for the threads listed above:
===
"Topology-2 (735/2000)":
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227)
- waiting to lock <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86)
at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467)
at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479)
at 
org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
- locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
..
"LeaseRenewer:admin@na61storage":
at 
org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750)
- waiting to lock <0x0005ec0dfa88> (a 
org.apache.hadoop.hdfs.DFSOutputStream)
at 
org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453)
- locked <0x0005d3521078> (a 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76)
at 
org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310)
at java.lang.Thread.run(Thread.java:834)

Found 1 deadlock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-12748) Standby NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY

2017-10-30 Thread Jiandan Yang (JIRA)
Jiandan Yang  created HDFS-12748:


 Summary: Standby NameNode memory leak when accessing webhdfs 
GETHOMEDIRECTORY
 Key: HDFS-12748
 URL: https://issues.apache.org/jira/browse/HDFS-12748
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 2.8.2
Reporter: Jiandan Yang 


In our production environment, the standby NN often do fullgc, through mat we 
found the largest object is FileSystem$Cache, which contains 7,844,890 
DistributedFileSystem.
By view hierarchy of method FileSystem.get() , I found only 
NamenodeWebHdfsMethods#get call FileSystem.get(). I don't know why creating 
different DistributedFileSystem every time instead of get a FileSystem from 
cache.

{code:java}
case GETHOMEDIRECTORY: {
  final String js = JsonUtil.toJsonString("Path",
  FileSystem.get(conf != null ? conf : new Configuration())
  .getHomeDirectory().toUri().getPath());
  return Response.ok(js).type(MediaType.APPLICATION_JSON).build();
}
{code}
When we close FileSystem when GETHOMEDIRECTORY, NN don't do fullgc.

{code:java}
case GETHOMEDIRECTORY: {
  FileSystem fs = null;
  try {
fs = FileSystem.get(conf != null ? conf : new Configuration());
final String js = JsonUtil.toJsonString("Path",
fs.getHomeDirectory().toUri().getPath());
return Response.ok(js).type(MediaType.APPLICATION_JSON).build();
  } finally {
if (fs != null) {
  fs.close();
}
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-29 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-7060:

Attachment: HDFS-7060.003.patch

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, 
> HDFS-7060.001.patch, HDFS-7060.003.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-29 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-7060:

Attachment: (was: HDFS-7060-003.patch)

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, 
> HDFS-7060.001.patch, complete_failed_qps.png, sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-29 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789
 ] 

Jiandan Yang  edited comment on HDFS-7060 at 10/30/17 2:55 AM:
---

We have deployed Hadoop including patch of HDFS-7060.003.patch and finished 
rolling upgrade at 22. The sendHeartbeat latency decreased significantly, and 
QPS of complete_failed  also down from following graphs.

DataNode sendHeartbeat latency:
!sendHeartbeat.png|datanode_sendHeartbeat_latency!

Namenode complete failed QPS
!complete_failed_qps.png|complete_failed_qps!


was (Author: yangjiandan):
We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down.

DataNode sendHeartbeat latency:
!sendHeartbeat.png|datanode_sendHeartbeat_latency!

Namenode complete failed QPS
!complete_failed_qps.png|complete_failed_qps!

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789
 ] 

Jiandan Yang  edited comment on HDFS-7060 at 10/27/17 6:38 AM:
---

We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down.

DataNode sendHeartbeat latency:
!sendHeartbeat.png|datanode_sendHeartbeat_latency!

Namenode complete failed QPS
!complete_failed_qps.png|complete_failed_qps!


was (Author: yangjiandan):
We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down
!sendHeartbeat.png|datanode_sendHeartbeat_latency!
!complete_failed_qps.png|complete_failed_qps!

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-7060:

Attachment: complete_failed_qps.png
sendHeartbeat.png

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789
 ] 

Jiandan Yang  edited comment on HDFS-7060 at 10/27/17 6:36 AM:
---

We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down
!sendHeartbeat.png|datanode_sendHeartbeat_latency!
!complete_failed_qps.png|complete_failed_qps!


was (Author: yangjiandan):
We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down
!sendHeartbeat.jpg|datanode_sendHeartbeat_latency!
!complete_failed_qps.jpg|complete_failed_qps!

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789
 ] 

Jiandan Yang  commented on HDFS-7060:
-

We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat 
latency decreased significantly, and QPS of complete_failed  also down
!sendHeartbeat.jpg|datanode_sendHeartbeat_latency!
!complete_failed_qps.jpg|complete_failed_qps!

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221765#comment-16221765
 ] 

Jiandan Yang  commented on HDFS-7060:
-

upload HDFS-7060-003.patch

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-27 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-7060:

Attachment: HDFS-7060-003.patch

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2017-10-20 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212333#comment-16212333
 ] 

Jiandan Yang  commented on HDFS-7060:
-

[~xinwei] [~brahmareddy] [~jojochuang]  We encountered the same 
problem(branch-2.8.2), BPServiceActor#offerService blocked because 
sendHeartBeat waited for FSDataset lock, and blockReceivedAndDeleted was delay 
about 60s, and eventually  client can not close file and threw Exception 
"Unable to close file because the last blockxxx does not have enough number of 
replicas”

I think HDFS-7060 can solve our problem very well. Does this patch have any 
problem? Why does it merge into trunk.

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, 
> HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-16 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated HDFS-12638:
-
Status: Patch Available  (was: Open)

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >