[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Attachment: HDFS-13915.003.patch > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch, > HDFS-13915.003.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689199#comment-16689199 ] Jiandan Yang commented on HDFS-13915: -- There may be something wrong with Jenkins, uploading [^HDFS-13915.003.patch] to retrigger Jenkins > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687599#comment-16687599 ] Jiandan Yang commented on HDFS-14045: -- Failed UTs are not caused by [^HDFS-14045.011.patch], and I can run them successfully in my local machine. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687587#comment-16687587 ] Jiandan Yang commented on HDFS-13915: -- Uploading [^HDFS-13915.002.patch] to fix checkstyle, whitespace and ut. Can anyone please help me to review this patch. > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Attachment: HDFS-13915.002.patch > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch, HDFS-13915.002.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687437#comment-16687437 ] Jiandan Yang edited comment on HDFS-14045 at 11/15/18 3:20 AM: Hi, [~elgoiri], I see what you mean. {quote} For the Unknown-Unknown, I'm not sure is worth showing them, we should just not store those, what we had already covered this. {quote} Absolutely right, there is no need to metric for the Unknown-Unknown {quote} For the ns0-Unknown, we should just make it ns0. {quote} This is a very good suggestion and thank you very much. I've updated code according to your comments in [^HDFS-14045.011.patch] was (Author: yangjiandan): Hi, [~elgoiri], I see what you mean. {quote} For the Unknown-Unknown, I'm not sure is worth showing them, we should just not store those, what we had already covered this. {quote} Absolutely right, there is no need to metric for the Unknown-Unknown {quote} For the ns0-Unknown, we should just make it ns0. {quote} This is a very good suggestion and thank you very much. I've update code according to your comments in [^HDFS-14045.011.patch] > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.011.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch, HDFS-14045.010.patch, HDFS-14045.011.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687437#comment-16687437 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~elgoiri], I see what you mean. {quote} For the Unknown-Unknown, I'm not sure is worth showing them, we should just not store those, what we had already covered this. {quote} Absolutely right, there is no need to metric for the Unknown-Unknown {quote} For the ns0-Unknown, we should just make it ns0. {quote} This is a very good suggestion and thank you very much. I've update code according to your comments in [^HDFS-14045.011.patch] > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch, HDFS-14045.010.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Attachment: (was: HDFS-13915.001.patch) > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Attachment: HDFS-13915.001.patch > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Assignee: Jiandan Yang Status: Patch Available (was: Open) > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Attachment: HDFS-13915.001.patch > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > Attachments: HDFS-13915.001.patch > > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686177#comment-16686177 ] Jiandan Yang edited comment on HDFS-13915 at 11/14/18 8:01 AM: I add a case in [^HDFS-13915.001.patch] based on trunk to reproduce issue. HI, [~szetszwo] BlockStoragePolicy#chooseStorageTypes may return excessive storageType, and I do not understander why after looking through related code. Can we remove excessive storageType? {code:java} if (storageTypes.size() < expectedSize) { LOG.warn("Failed to place enough replicas: expected size is {}" + " but only {} storage types can be selected (replication={}," + " selected={}, unavailable={}" + ", removed={}" + ", policy={}" + ")", expectedSize, storageTypes.size(), replication, storageTypes, unavailables, removed, this); } else if (storageTypes.size() > expectedSize) { //should remove excess storageType to return expectedSize storageType int storageTypesSize = storageTypes.size(); int excessiveStorageTypeNum = storageTypesSize - expectedSize; for (int i = 0; i < excessiveStorageTypeNum; i++) { storageTypes.remove(storageTypesSize - 1 - i); } } {code} was (Author: yangjiandan): I add a case in [^HDFS-13915.001.patch] based on trunk to reproduce issue. HI, [~szetszwo] BlockStoragePolicy#chooseStorageTypes may return excessive storageType, and I do not understander why after looking through related code. Can we remove excessive storageType? {code:java} if (storageTypes.size() < expectedSize) { LOG.warn("Failed to place enough replicas: expected size is {}" + " but only {} storage types can be selected (replication={}," + " selected={}, unavailable={}" + ", removed={}" + ", policy={}" + ")", expectedSize, storageTypes.size(), replication, storageTypes, unavailables, removed, this); } else if (storageTypes.size() > expectedSize) { //should remove excess storageType to return expectedSize storageType } {code} > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > +
[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686177#comment-16686177 ] Jiandan Yang commented on HDFS-13915: -- I add a case in [^HDFS-13915.001.patch] based on trunk to reproduce issue. HI, [~szetszwo] BlockStoragePolicy#chooseStorageTypes may return excessive storageType, and I do not understander why after looking through related code. Can we remove excessive storageType? {code:java} if (storageTypes.size() < expectedSize) { LOG.warn("Failed to place enough replicas: expected size is {}" + " but only {} storage types can be selected (replication={}," + " selected={}, unavailable={}" + ", removed={}" + ", policy={}" + ")", expectedSize, storageTypes.size(), replication, storageTypes, unavailables, removed, this); } else if (storageTypes.size() > expectedSize) { //should remove excess storageType to return expectedSize storageType } {code} > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode
[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686140#comment-16686140 ] Jiandan Yang edited comment on HDFS-14045 at 11/14/18 6:49 AM: There are many "[ERROR] Error occurred in starting fork, check output in log" in test log, and I think there may be something wrong with Jenkins. Uploading [^HDFS-14045.010.patch] to trigger Jenkins. was (Author: yangjiandan): There are many "[ERROR] Error occurred in starting fork, check output in log" in test log, and I think there may be something wrong with Jenkins. Uploading [^HDFS-14045] to trigger Jenkins. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686140#comment-16686140 ] Jiandan Yang commented on HDFS-14045: -- There are many "[ERROR] Error occurred in starting fork, check output in log" in test log, and I think there may be something wrong with Jenkins. Uploading [^HDFS-14045] to trigger Jenkins. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.010.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch, HDFS-14045.010.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134 ] Jiandan Yang edited comment on HDFS-14045 at 11/14/18 6:41 AM: Thanks [~elgoiri] for you comments. {quote}TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to extract the suffix. {quote} I've remove suffix in [^HDFS-14045.009.patch] {quote}I'm not sure about the Unknown-Unknown behavior, if we cannot determine the id, we may want to just leave it as it was? {quote} Do you mean do not make metrics when suffix is Unknown-Unknown?I do not understand what you mean. {quote}Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are still showing the old values? It looks good but just to verify. {quote} A good suggestion, I've add verification about HeartbeatsNumOps in [^HDFS-14045.009.patch] was (Author: yangjiandan): Thanks [~elgoiri] for you comments. {quote} TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to extract the suffix. {quote} I've remove suffix in [^HDFS-14045.009.patch] {quote} I'm not sure about the Unknown-Unknown behavior, if we cannot determine the id, we may want to just leave it as it was? {quote} Do you mean do not make metrics when suffix is Unknown-Unknown?I do not understand what your mean. {quote} Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are still showing the old values? It looks good but just to verify. {quote} A good suggestion, I've add verification about HeartbeatsNumOps in [^HDFS-14045.009.patch] > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134 ] Jiandan Yang edited comment on HDFS-14045 at 11/14/18 6:38 AM: Thanks [~elgoiri] for you comments. {quote} TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to extract the suffix. {quote} I've remove suffix in [^HDFS-14045.009.patch] {quote} I'm not sure about the Unknown-Unknown behavior, if we cannot determine the id, we may want to just leave it as it was? {quote} Do you mean do not make metrics when suffix is Unknown-Unknown?I do not understand what your mean. {quote} Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are still showing the old values? It looks good but just to verify. {quote} A good suggestion, I've add verification about HeartbeatsNumOps in [^HDFS-14045.009.patch] was (Author: yangjiandan): Thanks [~elgoiri] for you comments. {quota} TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to extract the suffix. {quota} I've remove suffix in [^HDFS-14045.009.patch] {quota} I'm not sure about the Unknown-Unknown behavior, if we cannot determine the id, we may want to just leave it as it was? {quota} Do you mean do not make metrics when suffix is Unknown-Unknown?I do not understand what your mean. {quota} Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are still showing the old values? It looks good but just to verify. {quota} A good suggestion, I've add verification about HeartbeatsNumOps in [^HDFS-14045.009.patch] > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686134#comment-16686134 ] Jiandan Yang commented on HDFS-14045: -- Thanks [~elgoiri] for you comments. {quota} TestDataNodeMetrics#testNNRpcMetricsWithFederationAndHA(), testNNRpcMetricsWithFederation() and testNNRpcMetricsWithHA(), no need to extract the suffix. {quota} I've remove suffix in [^HDFS-14045.009.patch] {quota} I'm not sure about the Unknown-Unknown behavior, if we cannot determine the id, we may want to just leave it as it was? {quota} Do you mean do not make metrics when suffix is Unknown-Unknown?I do not understand what your mean. {quota} Which unit test makes sure that HeartbeatsNumOps and HeartbeatsAvgTime are still showing the old values? It looks good but just to verify. {quota} A good suggestion, I've add verification about HeartbeatsNumOps in [^HDFS-14045.009.patch] > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.009.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch, > HDFS-14045.009.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684619#comment-16684619 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~xkrogen] I have updated patch according to your review comments, please help reviewing again. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682726#comment-16682726 ] Jiandan Yang commented on HDFS-14045: -- TestNameNodeMXBean running failed is not related with this patch, I run successfully in my local machine. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.008.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch, HDFS-14045.008.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682395#comment-16682395 ] Jiandan Yang commented on HDFS-14045: -- Thanks [~xkrogen] for your comments very much. {quote} Can we change the same of the method/parameter to something indicating it is for metrics only, maybe like nnLatencyMetricsSuffix? It looks particularly odd to me in IncrementalBlockReportManager right now. {quote} I rename {{nnLatencyMetricsSuffix}} into {{rpcMetricSuffix}}, what do you think of this name? {quote} I think I would prefer to see the existing methods in DataNodeMetrics changed to update both metrics, rather than the caller having to remember to call both methods. It introduces less possibility for the two metrics to get out of sync later. {quote} Very good suggestion, I have changed to update both metrics at one method in patch008, but serviceId-nnId is needed when updating metric, so there is need to add a parameter as suffix of metrics in the existing methods. {quote} I'm not sure if you should re-use the same MutableRatesWithAggregation for all of the metrics. It seems cleaner to me to have one per metric type, e.g. one for heartbeats, one for lifeline, and so on, but let me know if you disagree. I think this may even make it so that, if you set up the names correctly, the MutableRatesWithAggregation can replace the existing MutableRate while maintaining the name of the metric. Not 100% sure on this. {quote} I prefer to re-use MutableRatesWithAggregation for simplicity, it does not need to add fields when adding new metrics. {quote} You should update Metrics.md documenting these new metrics {quote} Thanks for reminding to modify Metrics.md, and newly added metrics have been written to Metrics.md in patch008 > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681024#comment-16681024 ] Jiandan Yang commented on HDFS-14045: -- Failed ut is not related with this patch, I run successfully in my local machine. [~cheersyang],[~elgoiri],[~xkrogen] Would you please help me reviewing again? > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.006.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.007.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.005.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679222#comment-16679222 ] Jiandan Yang edited comment on HDFS-14045 at 11/8/18 3:01 AM: --- Hi, [~elgoiri], I agree with you very much. Setting interval and unit has better readability, I'll update in patch5 was (Author: yangjiandan): Hi, [~elgoiri], I do agree with you. Setting interval and unit has better readability, I'll update in patch5 > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679413#comment-16679413 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~cheersyang] and [~xkrogen], thanks for your suggestions very much. I add nameServiceId and NameNodeId into names of metrics dynamically, and I keep old metrics for compatibility. Please help me review patch6. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679807#comment-16679807 ] Jiandan Yang commented on HDFS-14045: -- There may be something wrong in jenkins. Fix checkstyle error and upload patch007 to retrigger jenkins. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch, HDFS-14045.005.patch, > HDFS-14045.006.patch, HDFS-14045.007.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679222#comment-16679222 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~elgoiri], I do agree with you. Setting interval and unit has better readability, I'll update in patch5 > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677737#comment-16677737 ] Jiandan Yang commented on HDFS-13984: -- upload patch3 to trigger running test > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch, > HDFS-13984.003.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Attachment: HDFS-13984.003.patch > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch, > HDFS-13984.003.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677695#comment-16677695 ] Jiandan Yang commented on HDFS-14045: -- Thanks [~xkrogen] for your suggest and reminding about multiple standbys. At first I thought getting active latency is enough because standy does not provide service for users, and after reading your comments I realize it's helpful to monitor observer if it also offer service, but I think grouping metrics by role of NN is more better than by NameNode ID, because we can not know which metric is Active/Standy/Observer from name of metric. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.004.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch, HDFS-14045.004.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677647#comment-16677647 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~elgoiri] * Sorry for forgetting to fix checkstyle issue, I'll fix this problem. * The unit of dfs.heartbeat.interval is *second* instead of millisecond * Your guess is exactly right and I add assert in the unit. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676151#comment-16676151 ] Jiandan Yang commented on HDFS-14045: -- Hi [~cheersyang], {{BPOfferService#updateActorStatesFromHeartbeat}} use HAServiceState to define which NN is active, and keep the active actor in field {{bpServiceToActive}}, so we can use {{bpServiceToActive}} directly. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676077#comment-16676077 ] Jiandan Yang commented on HDFS-13984: -- Hi, [~jzhuge] The failed unit is not caused by this patch, and [~anatoli.shein] is resolving in HDFS-14047. I am confused about compiler warning, hdfs.c:3484 and libhdfs_wrapper.c:21 is not introduced by this patch. > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.003.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch, > HDFS-14045.003.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676059#comment-16676059 ] Jiandan Yang commented on HDFS-14045: -- Thanks [~elgoiri] for your review comments. * I'v add javadoc to {{BPServiceActor#isActive()}} and remove invalid comment in {{testNNRpcMetricsWithHA}} in patch003. * About UT what I explain is: ** The purpose to set heartbeat interval to 3000 second is to prevent bpServiceActor sends heartbeat periodically to NN during running test case, and bpServiceActor only sends heartbeat once after startup if we don't trigger heartbeat. ** The purpose to trigger heartbeat twice is to get active NN from heartbeat response and update metrics. The first trigger is to get active NN from heartbeat response after one NN transition to active, so one of two service actors will be update to active actor; The second trigger is to update HeartbeatsNumOps and HeartbeatsForStandbyNumOps by sending heartbeat to Active and Standby > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.002.patch > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch, HDFS-14045.002.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675084#comment-16675084 ] Jiandan Yang commented on HDFS-14045: -- Hi, [~cheersyang],[~elgoiri] and [~xkrogen] Thanks very much for your review and comments. To [~cheersyang] This patch works under HA, non-HA situation and federation. Different clusters have different namespaceID, so namespaceID is not suitable as metric name. I think the most reasonable method is add some tags in metric level, but only metric-source level can allow to add tags, so I use different metric. Fortunately, there is no need to add too many metric. To [~xkrogen] and [~elgoiri] I'v added some ut about HA and Non-HA in patch002, could you please help reviewing this again?. > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672683#comment-16672683 ] Jiandan Yang commented on HDFS-14045: -- [~cheersyang] Would you please help me review this patch? > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672654#comment-16672654 ] Jiandan Yang commented on HDFS-13984: -- [~jzhuge] Could you help me review this patch? > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-14045: - Attachment: HDFS-14045.001.patch Status: Patch Available (was: Open) > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-14045.001.patch > > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
[ https://issues.apache.org/jira/browse/HDFS-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang reassigned HDFS-14045: Assignee: Jiandan Yang > Use different metrics in DataNode to better measure latency of > heartbeat/blockReports/incrementalBlockReports of Active/Standby NN > -- > > Key: HDFS-14045 > URL: https://issues.apache.org/jira/browse/HDFS-14045 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > > Currently DataNode uses same metrics to measure rpc latency of NameNode, but > Active and Standby usually have different performance at the same time, > especially in large cluster. For example, rpc latency of Standby is very long > when Standby is catching up editlog. We may misunderstand the state of HDFS. > Using different metrics for Active and standby can help us obtain more > precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
Jiandan Yang created HDFS-14045: Summary: Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN Key: HDFS-14045 URL: https://issues.apache.org/jira/browse/HDFS-14045 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: Jiandan Yang Currently DataNode uses same metrics to measure rpc latency of NameNode, but Active and Standby usually have different performance at the same time, especially in large cluster. For example, rpc latency of Standby is very long when Standby is catching up editlog. We may misunderstand the state of HDFS. Using different metrics for Active and standby can help us obtain more precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Attachment: HDFS-13984.002.patch > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Attachment: (was: HDFS-13984.002.patch) > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Attachment: HDFS-13984.002.patch > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch, HDFS-13984.002.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Description: getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. was: getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. *FileSystem#exists* also call > *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Description: getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. was: getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. > *FileSystem#exists* also call *FileSystem#getFileStatus*, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Status: Patch Available (was: Open) > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. *FileSystem#exists* also call > *FileSystem#getFileStatus, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
[ https://issues.apache.org/jira/browse/HDFS-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13984: - Attachment: HDFS-13984.001.patch > getFileInfo of libhdfs call NameNode#getFileStatus twice > > > Key: HDFS-13984 > URL: https://issues.apache.org/jira/browse/HDFS-13984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-13984.001.patch > > > getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls > *FileSystem#getFileStatus*. *FileSystem#exists* also call > *FileSystem#getFileStatus, just as follows: > {code:java} > public boolean exists(Path f) throws IOException { > try { > return getFileStatus(f) != null; > } catch (FileNotFoundException e) { > return false; > } > } > {code} > and finally this leads to call NameNodeRpcServer#getFileInfo twice. > Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
Jiandan Yang created HDFS-13984: Summary: getFileInfo of libhdfs call NameNode#getFileStatus twice Key: HDFS-13984 URL: https://issues.apache.org/jira/browse/HDFS-13984 Project: Hadoop HDFS Issue Type: Improvement Components: libhdfs Reporter: Jiandan Yang Assignee: Jiandan Yang getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617092#comment-16617092 ] Jiandan Yang commented on HDFS-13915: -- [~hexiaoqiao] 2.6.5, and I found trunk also have the same problem after checking code. > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang reassigned HDFS-13915: Assignee: (was: Jiandan Yang ) > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Priority: Major > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > {code:java} > @Test > public void testAllSSDFallbackAndNonNewBlock() { > final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); > List storageTypes = allSSD.chooseStorageTypes((short) 3, > Arrays.asList(StorageType.DISK, StorageType.SSD), > EnumSet.noneOf(StorageType.class), false); > assertEquals(2, storageTypes.size()); > assertEquals(StorageType.SSD, storageTypes.get(0)); > assertEquals(StorageType.SSD, storageTypes.get(1)); > } > {code} > 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to > 2 and choose additional two datanodes > 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client > 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and > throw IOException, and finally lead to write failed > {code:java} > private int findNewDatanode(final DatanodeInfo[] original > ) throws IOException { > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > for(int i = 0; i < nodes.length; i++) { > int j = 0; > for(; j < original.length && !nodes[i].equals(original[j]); j++); > if (j == original.length) { > return i; > } > } > throw new IOException("Failed: new datanode not found: nodes=" > + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); > } > {code} > client warn logs is: > {code:java} > WARN [DataStreamer for file > /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 > block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], > > DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], > > original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], > > DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Description: Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] {code:java} @Test public void testAllSSDFallbackAndNonNewBlock() { final BlockStoragePolicy allSSD = POLICY_SUITE.getPolicy(ALLSSD); List storageTypes = allSSD.chooseStorageTypes((short) 3, Arrays.asList(StorageType.DISK, StorageType.SSD), EnumSet.noneOf(StorageType.class), false); assertEquals(2, storageTypes.size()); assertEquals(StorageType.SSD, storageTypes.get(0)); assertEquals(StorageType.SSD, storageTypes.get(1)); } {code} 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed {code:java} private int findNewDatanode(final DatanodeInfo[] original ) throws IOException { if (nodes.length != original.length + 1) { throw new IOException( "Failed to replace a bad datanode on the existing pipeline " + "due to no more good datanodes being available to try. " + "(Nodes: current=" + Arrays.asList(nodes) + ", original=" + Arrays.asList(original) + "). " + "The current failed datanode replacement policy is " + dfsClient.dtpReplaceDatanodeOnFailure + ", and a client may configure this via '" + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY + "' in its configuration."); } for(int i = 0; i < nodes.length; i++) { int j = 0; for(; j < original.length && !nodes[i].equals(original[j]); j++); if (j == original.length) { return i; } } throw new IOException("Failed: new datanode not found: nodes=" + Arrays.asList(nodes) + ", original=" + Arrays.asList(original)); } {code} client warn logs is: {code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. {code} was: Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed client warn logs is: {code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD],
[jira] [Updated] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
[ https://issues.apache.org/jira/browse/HDFS-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-13915: - Description: Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed client warn logs is: {code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. {code} was: Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed client warn logs is: \{code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. {code} > replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode > returning excessive datanodeInfo > > > Key: HDFS-13915 > URL: https://issues.apache.org/jira/browse/HDFS-13915 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > > Consider following situation: > 1. create a file with ALLSSD policy > 2. return [SSD,SSD,DISK] due to lack of SSD space > 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write > pipeline and replacing bad datanode > 4. BlockPlacementPolicyDefault#chooseTarget will call > StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but > chooseStorageTypes return [SSD,SSD] > 5. do numOfReplicas =
[jira] [Created] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
Jiandan Yang created HDFS-13915: Summary: replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo Key: HDFS-13915 URL: https://issues.apache.org/jira/browse/HDFS-13915 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Environment: Reporter: Jiandan Yang Assignee: Jiandan Yang Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed client warn logs is: \{code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385686#comment-16385686 ] Jiandan Yang commented on HDFS-9666: - [~jzhuge] [~vinayrpet] Could you help me review it. > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383667#comment-16383667 ] Jiandan Yang commented on HDFS-9666: - Failed UTs are not caused by this patch, and I run these UTs successfully in my local machine. [~vinodkv] [~arpitagarwal] Please help me review this patch, Thanks. > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383359#comment-16383359 ] Jiandan Yang commented on HDFS-9666: - upload v4 patch: set refetchIfRequired=true when chooseDataNode in fetchBlockByteRange and fix ut error > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-9666: Attachment: HDFS-9666.004.patch > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch, HDFS-9666.003.patch, HDFS-9666.004.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-9666: Attachment: HDFS-9666.003.patch > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch, HDFS-9666.003.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383168#comment-16383168 ] Jiandan Yang commented on HDFS-9666: - fix compiler error and upload v2 patch > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-9666: Attachment: HDFS-9666.002.patch > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch, > HDFS-9666.002.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-9666: Status: Patch Available (was: Open) > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0, 2.6.0 >Reporter: ade >Assignee: ade >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang reassigned HDFS-9666: --- Assignee: Jiandan Yang (was: ade) > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: Jiandan Yang >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383083#comment-16383083 ] Jiandan Yang commented on HDFS-9666: - upload v1 patch based trunk > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: ade >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read
[ https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-9666: Attachment: HDFS-9666.001.patch > Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to > improve random read > - > > Key: HDFS-9666 > URL: https://issues.apache.org/jira/browse/HDFS-9666 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0 >Reporter: ade >Assignee: ade >Priority: Major > Attachments: HDFS-9666.0.patch, HDFS-9666.001.patch > > > We want to improve random read performance of HDFS for HBase, so enabled the > heterogeneous storage in our cluster. But there are only ~50% of datanode & > regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD > storagepolicy and the regionserver on none-SSD host can only read the local > disk replica . So we developed this feature in hdfs client to read even > remote SSD/RAM prior to local disk replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11942) make new chooseDataNode policy work in more operation like seek, fetch
[ https://issues.apache.org/jira/browse/HDFS-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379902#comment-16379902 ] Jiandan Yang commented on HDFS-11942: -- [~whisper_deng] This patch is very important for HBase, Why not keep on going? > make new chooseDataNode policy work in more operation like seek, fetch > > > Key: HDFS-11942 > URL: https://issues.apache.org/jira/browse/HDFS-11942 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 2.6.0, 2.7.0, 3.0.0-alpha3 >Reporter: Fangyuan Deng >Priority: Major > Fix For: 3.0.1 > > Attachments: HDFS-11942.0.patch, HDFS-11942.1.patch, > ssd-first-disable(default).png, ssd-first-enable.png > > > in default policy, if a file is ONE_SSD, client will prior read the local > disk replica rather than the remote ssd replica. > but now, the pci-e SSD and 10G ethernet make remote read SSD more faster than > the local disk. > HDFS-9666 give us a patch, but the code is not complete and not updated for > a long time. > this sub-task issue give a complete patch and > we have tested on three machines [ 32 core cpu, 128G mem , 1000M network, > 1.2T HDD, 800G SSD(intel P3600) ]. > with this feather, throughput of hbase table(ONE_SSD) is double of which > without this feather -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
[ https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12814: - Attachment: HDFS-12814.002.patch > Add blockId when warning slow mirror/disk in BlockReceiver > -- > > Key: HDFS-12814 > URL: https://issues.apache.org/jira/browse/HDFS-12814 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: HDFS-12814.001.patch, HDFS-12814.002.patch > > > HDFS-11603 add downstream DataNodeIds and volume path. > In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
[ https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254631#comment-16254631 ] Jiandan Yang commented on HDFS-12814: -- Thanks [~cheersyang] and [~msingh] for review. I will add the comma delimiter and upload v2 patch > Add blockId when warning slow mirror/disk in BlockReceiver > -- > > Key: HDFS-12814 > URL: https://issues.apache.org/jira/browse/HDFS-12814 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: HDFS-12814.001.patch > > > HDFS-11603 add downstream DataNodeIds and volume path. > In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
[ https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12814: - Status: Patch Available (was: Open) > Add blockId when warning slow mirror/disk in BlockReceiver > -- > > Key: HDFS-12814 > URL: https://issues.apache.org/jira/browse/HDFS-12814 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: HDFS-12814.001.patch > > > HDFS-11603 add downstream DataNodeIds and volume path. > In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
[ https://issues.apache.org/jira/browse/HDFS-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12814: - Attachment: HDFS-12814.001.patch > Add blockId when warning slow mirror/disk in BlockReceiver > -- > > Key: HDFS-12814 > URL: https://issues.apache.org/jira/browse/HDFS-12814 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: HDFS-12814.001.patch > > > HDFS-11603 add downstream DataNodeIds and volume path. > In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
Jiandan Yang created HDFS-12814: Summary: Add blockId when warning slow mirror/disk in BlockReceiver Key: HDFS-12814 URL: https://issues.apache.org/jira/browse/HDFS-12814 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Jiandan Yang Assignee: Jiandan Yang Priority: Minor HDFS-11603 add downstream DataNodeIds and volume path. In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859 ] Jiandan Yang edited comment on HDFS-12754 at 11/15/17 2:41 AM: [~xiaochen] Thank you for reviewing. {quote}The fix here is to close the output streams out of the lease renewer lock{quote} I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} was (Author: yangjiandan): [~xiaochen] Thank you for reviewing. @ The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, > HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, > HDFS-12754.006.patch, HDFS-12754.007.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859 ] Jiandan Yang commented on HDFS-12754: -- [~xiaochen] Thank you for reviewing. The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, > HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, > HDFS-12754.006.patch, HDFS-12754.007.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859 ] Jiandan Yang edited comment on HDFS-12754 at 11/15/17 2:35 AM: [~xiaochen] Thank you for reviewing. @ The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} was (Author: yangjiandan): [~xiaochen] Thank you for reviewing. @The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, > HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, > HDFS-12754.006.patch, HDFS-12754.007.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252859#comment-16252859 ] Jiandan Yang edited comment on HDFS-12754 at 11/15/17 2:34 AM: [~xiaochen] Thank you for reviewing. @The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} was (Author: yangjiandan): [~xiaochen] Thank you for reviewing. The fix here is to close the output streams out of the lease renewer lock I think you may be wrong. The fix is {{LeaseRenewer#run}} does not hold {{LeaseRenewer}} object lock and {{DFSOutputStream}} object lock at the same time, removes dfsClient.closeAllFilesBeingWritten out of synchronized block. {{LeaseRenewer#run}} gets {{LeaseRenewer}} object lock and then releases, gets {{DFSOutputStream}} object lock and releases. {code:java} synchronized (this) { DFSClientFaultInjector.get().sleepWhenRenewLeaseTimeout(); dfsclientsCopy = new ArrayList<>(dfsclients); dfsclients.clear(); //Expire the current LeaseRenewer thread. emptyTime = 0; Factory.INSTANCE.remove(LeaseRenewer.this); } for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch, > HDFS-12754.003.patch, HDFS-12754.004.patch, HDFS-12754.005.patch, > HDFS-12754.006.patch, HDFS-12754.007.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243633#comment-16243633 ] Jiandan Yang commented on HDFS-12754: -- v2 patch fixes dead lock and looks good to me, [~elgoiri] [~cheersyang] [~kihwal] Do you have any opinion? > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: HDFS-12754.001.patch, HDFS-12754.002.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12754) Lease renewal can hit a deadlock
[ https://issues.apache.org/jira/browse/HDFS-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237063#comment-16237063 ] Jiandan Yang commented on HDFS-12754: -- Hi, [~kshukla], thanks for you patch. This patch could resolve dead lock, but I have two comments about your patch: 1. DFSOutputStream#clientClosed can be removed, I think it's duplicated with DFSOutputStream#closed. 2. When LeaseRenewer#run(int) catch SocketTimeoutException and run the follow code after clean LeaseRenewer#dfsclients, dfsclient may create file and add a new one to the LeaseRenewer#dfsclients {code:java} for (DFSClient dfsClient : dfsclientsCopy) { dfsClient.closeAllFilesBeingWritten(true); } {code} > Lease renewal can hit a deadlock > - > > Key: HDFS-12754 > URL: https://issues.apache.org/jira/browse/HDFS-12754 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: HDFS-12754.001.patch > > > The Client and the renewer can hit a deadlock during close operation since > closeFile() reaches back to the DFSClient#removeFileBeingWritten. This is > possible if the client class close when the renewer is renewing a lease. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException
[ https://issues.apache.org/jira/browse/HDFS-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12757: - Attachment: HDFS-12757.patch upload a UT to reproduce issue > DeadLock Happened Between DFSOutputStream and LeaseRenewer when > LeaseRenewer#renew SocketTimeException > -- > > Key: HDFS-12757 > URL: https://issues.apache.org/jira/browse/HDFS-12757 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Jiandan Yang >Priority: Major > Attachments: HDFS-12757.patch > > > Java stack is : > {code:java} > Found one Java-level deadlock: > = > "Topology-2 (735/2000)": > waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a > org.apache.hadoop.hdfs.client.impl.LeaseRenewer), > which is held by "LeaseRenewer:admin@na61storage" > "LeaseRenewer:admin@na61storage": > waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a > org.apache.hadoop.hdfs.DFSOutputStream), > which is held by "Topology-2 (735/2000)" > Java stack information for the threads listed above: > === > "Topology-2 (735/2000)": > at > org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227) > - waiting to lock <0x0005d3521078> (a > org.apache.hadoop.hdfs.client.impl.LeaseRenewer) > at > org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86) > at > org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467) > at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479) > at > org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848) > - locked <0x0005ec0dfa88> (a > org.apache.hadoop.hdfs.DFSOutputStream) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) > - locked <0x0005ec0dfa88> (a > org.apache.hadoop.hdfs.DFSOutputStream) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > .. > "LeaseRenewer:admin@na61storage": > at > org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750) > - waiting to lock <0x0005ec0dfa88> (a > org.apache.hadoop.hdfs.DFSOutputStream) > at > org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586) > at > org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453) > - locked <0x0005d3521078> (a > org.apache.hadoop.hdfs.client.impl.LeaseRenewer) > at > org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76) > at > org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310) > at java.lang.Thread.run(Thread.java:834) > Found 1 deadlock. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException
[ https://issues.apache.org/jira/browse/HDFS-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12757: - Description: Java stack is : {code:java} Found one Java-level deadlock: = "Topology-2 (735/2000)": waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a org.apache.hadoop.hdfs.client.impl.LeaseRenewer), which is held by "LeaseRenewer:admin@na61storage" "LeaseRenewer:admin@na61storage": waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a org.apache.hadoop.hdfs.DFSOutputStream), which is held by "Topology-2 (735/2000)" Java stack information for the threads listed above: === "Topology-2 (735/2000)": at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227) - waiting to lock <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86) at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467) at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479) at org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776) at org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) .. "LeaseRenewer:admin@na61storage": at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750) - waiting to lock <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453) - locked <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310) at java.lang.Thread.run(Thread.java:834) Found 1 deadlock. {code} was: Java stack is : Found one Java-level deadlock: = "Topology-2 (735/2000)": waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a org.apache.hadoop.hdfs.client.impl.LeaseRenewer), which is held by "LeaseRenewer:admin@na61storage" "LeaseRenewer:admin@na61storage": waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a org.apache.hadoop.hdfs.DFSOutputStream), which is held by "Topology-2 (735/2000)" Java stack information for the threads listed above: === "Topology-2 (735/2000)": at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227) - waiting to lock <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86) at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467) at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479) at org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776) at org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) .. "LeaseRenewer:admin@na61storage": at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750) - waiting to lock <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453) - locked <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at
[jira] [Created] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException
Jiandan Yang created HDFS-12757: Summary: DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException Key: HDFS-12757 URL: https://issues.apache.org/jira/browse/HDFS-12757 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Reporter: Jiandan Yang Priority: Major Java stack is : Found one Java-level deadlock: = "Topology-2 (735/2000)": waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a org.apache.hadoop.hdfs.client.impl.LeaseRenewer), which is held by "LeaseRenewer:admin@na61storage" "LeaseRenewer:admin@na61storage": waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a org.apache.hadoop.hdfs.DFSOutputStream), which is held by "Topology-2 (735/2000)" Java stack information for the threads listed above: === "Topology-2 (735/2000)": at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227) - waiting to lock <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86) at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467) at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479) at org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776) at org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) .. "LeaseRenewer:admin@na61storage": at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750) - waiting to lock <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453) - locked <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310) at java.lang.Thread.run(Thread.java:834) Found 1 deadlock. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-12748) Standby NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY
Jiandan Yang created HDFS-12748: Summary: Standby NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY Key: HDFS-12748 URL: https://issues.apache.org/jira/browse/HDFS-12748 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.2 Reporter: Jiandan Yang In our production environment, the standby NN often do fullgc, through mat we found the largest object is FileSystem$Cache, which contains 7,844,890 DistributedFileSystem. By view hierarchy of method FileSystem.get() , I found only NamenodeWebHdfsMethods#get call FileSystem.get(). I don't know why creating different DistributedFileSystem every time instead of get a FileSystem from cache. {code:java} case GETHOMEDIRECTORY: { final String js = JsonUtil.toJsonString("Path", FileSystem.get(conf != null ? conf : new Configuration()) .getHomeDirectory().toUri().getPath()); return Response.ok(js).type(MediaType.APPLICATION_JSON).build(); } {code} When we close FileSystem when GETHOMEDIRECTORY, NN don't do fullgc. {code:java} case GETHOMEDIRECTORY: { FileSystem fs = null; try { fs = FileSystem.get(conf != null ? conf : new Configuration()); final String js = JsonUtil.toJsonString("Path", fs.getHomeDirectory().toUri().getPath()); return Response.ok(js).type(MediaType.APPLICATION_JSON).build(); } finally { if (fs != null) { fs.close(); } } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-7060: Attachment: HDFS-7060.003.patch > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, > HDFS-7060.001.patch, HDFS-7060.003.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-7060: Attachment: (was: HDFS-7060-003.patch) > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, > HDFS-7060.001.patch, complete_failed_qps.png, sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789 ] Jiandan Yang edited comment on HDFS-7060 at 10/30/17 2:55 AM: --- We have deployed Hadoop including patch of HDFS-7060.003.patch and finished rolling upgrade at 22. The sendHeartbeat latency decreased significantly, and QPS of complete_failed also down from following graphs. DataNode sendHeartbeat latency: !sendHeartbeat.png|datanode_sendHeartbeat_latency! Namenode complete failed QPS !complete_failed_qps.png|complete_failed_qps! was (Author: yangjiandan): We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down. DataNode sendHeartbeat latency: !sendHeartbeat.png|datanode_sendHeartbeat_latency! Namenode complete failed QPS !complete_failed_qps.png|complete_failed_qps! > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789 ] Jiandan Yang edited comment on HDFS-7060 at 10/27/17 6:38 AM: --- We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down. DataNode sendHeartbeat latency: !sendHeartbeat.png|datanode_sendHeartbeat_latency! Namenode complete failed QPS !complete_failed_qps.png|complete_failed_qps! was (Author: yangjiandan): We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down !sendHeartbeat.png|datanode_sendHeartbeat_latency! !complete_failed_qps.png|complete_failed_qps! > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For
[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-7060: Attachment: complete_failed_qps.png sendHeartbeat.png > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789 ] Jiandan Yang edited comment on HDFS-7060 at 10/27/17 6:36 AM: --- We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down !sendHeartbeat.png|datanode_sendHeartbeat_latency! !complete_failed_qps.png|complete_failed_qps! was (Author: yangjiandan): We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down !sendHeartbeat.jpg|datanode_sendHeartbeat_latency! !complete_failed_qps.jpg|complete_failed_qps! > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221789#comment-16221789 ] Jiandan Yang commented on HDFS-7060: - We have deploy Hadoop including patch of HDFS-7060-003.patch, sendHeartbeat latency decreased significantly, and QPS of complete_failed also down !sendHeartbeat.jpg|datanode_sendHeartbeat_latency! !complete_failed_qps.jpg|complete_failed_qps! > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221765#comment-16221765 ] Jiandan Yang commented on HDFS-7060: - upload HDFS-7060-003.patch > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-7060: Attachment: HDFS-7060-003.patch > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060-003.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212333#comment-16212333 ] Jiandan Yang commented on HDFS-7060: - [~xinwei] [~brahmareddy] [~jojochuang] We encountered the same problem(branch-2.8.2), BPServiceActor#offerService blocked because sendHeartBeat waited for FSDataset lock, and blockReceivedAndDeleted was delay about 60s, and eventually client can not close file and threw Exception "Unable to close file because the last blockxxx does not have enough number of replicas” I think HDFS-7060 can solve our problem very well. Does this patch have any problem? Why does it merge into trunk. > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, > HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
[ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated HDFS-12638: - Status: Patch Available (was: Open) > NameNode exits due to ReplicationMonitor thread received Runtime exception in > ReplicationWork#chooseTargets > --- > > Key: HDFS-12638 > URL: https://issues.apache.org/jira/browse/HDFS-12638 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.8.2 >Reporter: Jiandan Yang > Attachments: HDFS-12638-branch-2.8.2.001.patch > > > Active NamNode exit due to NPE, I can confirm that the BlockCollection passed > in when creating ReplicationWork is null, but I do not know why > BlockCollection is null, By view history I found > [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging > whether BlockCollection is null. > NN logs are as following: > {code:java} > 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > ReplicationMonitor thread received Runtime exception. > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744) > at java.lang.Thread.run(Thread.java:834) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org