[jira] [Commented] (HDFS-16875) Erasure Coding: data access proxy to allow old clients to read EC data
[ https://issues.apache.org/jira/browse/HDFS-16875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654209#comment-17654209 ] Jing Zhao commented on HDFS-16875: -- Posted the design doc for the EC access proxy. > Erasure Coding: data access proxy to allow old clients to read EC data > -- > > Key: HDFS-16875 > URL: https://issues.apache.org/jira/browse/HDFS-16875 > Project: Hadoop HDFS > Issue Type: New Feature > Components: ec, erasure-coding >Reporter: Jing Zhao >Assignee: Jing Zhao >Priority: Major > Attachments: Erasure Coding Access Proxy.pdf > > > Erasure Coding is only supported by Hadoop 3, while many production > deployments still depend on Hadoop 2. Upgrading the whole data tech stack to > the Hadoop 3 release may involve big migration efforts and even reliability > risks, considering the incompatibilities between these two Hadoop major > releases as well as the potential uncovered issues and risks hidden in newer > releases. Therefore, we need to find a solution, with the least amount of > migration effort and risk, to adopt Erasure Coding for cost efficiency but > still allow HDFS clients with old versions (Hadoop 2.x) to access EC data in > a transparent manner. > Internally we have developed an EC access proxy which translates the EC data > for old clients. We also extend the NameNode RPC so it can recognize HDFS > clients with/without the EC support, and redirect the old clients to the > proxy. With the proxy we set up separate Erasure Coding clusters storing > hundreds of PB of data, while leaving other production clusters and all the > upper layer applications untouched. > Considering some changes are made at fundamental components of HDFS (e.g., > client-NN RPC header), we do not aim to merge the change to trunk. We will > use this ticket to share the design and implementation details (including the > code) and collect feedback. We may use a separate github repo to open source > the implementation later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16875) Erasure Coding: data access proxy to allow old clients to read EC data
[ https://issues.apache.org/jira/browse/HDFS-16875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-16875: - Attachment: Erasure Coding Access Proxy.pdf > Erasure Coding: data access proxy to allow old clients to read EC data > -- > > Key: HDFS-16875 > URL: https://issues.apache.org/jira/browse/HDFS-16875 > Project: Hadoop HDFS > Issue Type: New Feature > Components: ec, erasure-coding >Reporter: Jing Zhao >Assignee: Jing Zhao >Priority: Major > Attachments: Erasure Coding Access Proxy.pdf > > > Erasure Coding is only supported by Hadoop 3, while many production > deployments still depend on Hadoop 2. Upgrading the whole data tech stack to > the Hadoop 3 release may involve big migration efforts and even reliability > risks, considering the incompatibilities between these two Hadoop major > releases as well as the potential uncovered issues and risks hidden in newer > releases. Therefore, we need to find a solution, with the least amount of > migration effort and risk, to adopt Erasure Coding for cost efficiency but > still allow HDFS clients with old versions (Hadoop 2.x) to access EC data in > a transparent manner. > Internally we have developed an EC access proxy which translates the EC data > for old clients. We also extend the NameNode RPC so it can recognize HDFS > clients with/without the EC support, and redirect the old clients to the > proxy. With the proxy we set up separate Erasure Coding clusters storing > hundreds of PB of data, while leaving other production clusters and all the > upper layer applications untouched. > Considering some changes are made at fundamental components of HDFS (e.g., > client-NN RPC header), we do not aim to merge the change to trunk. We will > use this ticket to share the design and implementation details (including the > code) and collect feedback. We may use a separate github repo to open source > the implementation later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16875) Erasure Coding: data access proxy to allow old clients to read EC data
Jing Zhao created HDFS-16875: Summary: Erasure Coding: data access proxy to allow old clients to read EC data Key: HDFS-16875 URL: https://issues.apache.org/jira/browse/HDFS-16875 Project: Hadoop HDFS Issue Type: New Feature Components: ec, erasure-coding Reporter: Jing Zhao Assignee: Jing Zhao Erasure Coding is only supported by Hadoop 3, while many production deployments still depend on Hadoop 2. Upgrading the whole data tech stack to the Hadoop 3 release may involve big migration efforts and even reliability risks, considering the incompatibilities between these two Hadoop major releases as well as the potential uncovered issues and risks hidden in newer releases. Therefore, we need to find a solution, with the least amount of migration effort and risk, to adopt Erasure Coding for cost efficiency but still allow HDFS clients with old versions (Hadoop 2.x) to access EC data in a transparent manner. Internally we have developed an EC access proxy which translates the EC data for old clients. We also extend the NameNode RPC so it can recognize HDFS clients with/without the EC support, and redirect the old clients to the proxy. With the proxy we set up separate Erasure Coding clusters storing hundreds of PB of data, while leaving other production clusters and all the upper layer applications untouched. Considering some changes are made at fundamental components of HDFS (e.g., client-NN RPC header), we do not aim to merge the change to trunk. We will use this ticket to share the design and implementation details (including the code) and collect feedback. We may use a separate github repo to open source the implementation later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding
[ https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-16874: - Description: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go a step further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times speed up) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. was: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied
[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding
[ https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-16874: - Description: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times speed up) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. was: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied
[jira] [Created] (HDFS-16874) Improve DataNode decommission for Erasure Coding
Jing Zhao created HDFS-16874: Summary: Improve DataNode decommission for Erasure Coding Key: HDFS-16874 URL: https://issues.apache.org/jira/browse/HDFS-16874 Project: Hadoop HDFS Issue Type: Improvement Components: ec, erasure-coding Reporter: Jing Zhao Assignee: Jing Zhao There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503858#comment-17503858 ] Jing Zhao commented on HDFS-16422: -- Looks like we already have r/w lock protection in AbstractNativeRawDecoder and its subclasses (NativeRSRawDecoder and NativeXORRawDecoder). So does that mean the extra protection is only necessary for other decoder implementations (such as RSRawDecoder)? HADOOP-15499 used r/w lock to replace the original object monitor (i.e. synchronized) so as to improve performance. Now looks like we're adding "synchronized" back to the APIs defined in the parent class. I guess instead of updating the decode APIs in RawErasureDecoder, we may want to only fix the subclasses without lock protection. What do you think, [~weichiu] [~cndaimin] ? > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: read dataUnits part of data > and decode them into the target missing data. Each DFSStripedInputStream > object has a RawErasureDecoder object, and when we doing pread concurrently, > RawErasureDecoder.decode will be invoked concurrently too. > RawErasureDecoder.decode is not thread safe, as a result of that we get wrong > data from pread occasionally. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls
[ https://issues.apache.org/jira/browse/HDFS-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436168#comment-17436168 ] Jing Zhao commented on HDFS-16283: -- Checking into the code, looks like besides the router side performance issue, the whole lease mechanism may need to get updated to support hdfs router. Currently the DFSClient uses a map(INodeID, DFSOutputStream) to track all the being-written files. The assumption is that all the being-written files are in the same nameservice thus there is no INodeID conflict. Now with the support of router, we may have two files belonging to two different nameservices sharing the same INodeID (though the possibility is very low in production). So theoretically we should update the being-written-file map to ((nameservice, INodeID) – DFSOutputStream). I understand the concern that with router we do not want client to know individual nameservices. It would be better if we can still hide nameservices. We can discuss different solutions in this ticket. In summary I guess we need to have a new mechanism to align the current INode ID based lease renewal approach to the new router architecture. > RBF: improve renewLease() to call only a specific NameNode rather than make > fan-out calls > - > > Key: HDFS-16283 > URL: https://issues.apache.org/jira/browse/HDFS-16283 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently renewLease() against a router will make fan-out to all the > NameNodes. Since renewLease() call is so frequent and if one of the NameNodes > are slow, then eventually the router queues are blocked by all renewLease() > and cause router degradation. > We will make a change in the client side to keep track of NameNode Id in > additional to current fileId so routers understand which NameNodes the client > is renewing lease against. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16268) Balancer stuck when moving striped blocks due to NPE
[ https://issues.apache.org/jira/browse/HDFS-16268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428552#comment-17428552 ] Jing Zhao commented on HDFS-16268: -- I've committed the fix. Thanks for the contribution, [~LeonG]! > Balancer stuck when moving striped blocks due to NPE > > > Key: HDFS-16268 > URL: https://issues.apache.org/jira/browse/HDFS-16268 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover, erasure-coding >Affects Versions: 3.2.2 >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > 21/10/11 06:11:26 WARN balancer.Dispatcher: Dispatcher thread failed > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.markMovedIfGoodBlock(Dispatcher.java:289) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.chooseBlockAndProxy(Dispatcher.java:272) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:236) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.chooseNextMove(Dispatcher.java:899) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.dispatchBlocks(Dispatcher.java:958) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.access$3300(Dispatcher.java:757) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$2.run(Dispatcher.java:1226) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Due to NPE in the middle, there will be pending moves left in the queue so > balancer will stuck forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16268) Balancer stuck when moving striped blocks due to NPE
[ https://issues.apache.org/jira/browse/HDFS-16268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-16268. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Balancer stuck when moving striped blocks due to NPE > > > Key: HDFS-16268 > URL: https://issues.apache.org/jira/browse/HDFS-16268 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover, erasure-coding >Affects Versions: 3.2.2 >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > 21/10/11 06:11:26 WARN balancer.Dispatcher: Dispatcher thread failed > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.markMovedIfGoodBlock(Dispatcher.java:289) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.chooseBlockAndProxy(Dispatcher.java:272) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:236) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.chooseNextMove(Dispatcher.java:899) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.dispatchBlocks(Dispatcher.java:958) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.access$3300(Dispatcher.java:757) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$2.run(Dispatcher.java:1226) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Due to NPE in the middle, there will be pending moves left in the queue so > balancer will stuck forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-10648) Expose Balancer metrics through Metrics2
[ https://issues.apache.org/jira/browse/HDFS-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-10648. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Resolved Thanks for the work, [~LeonG]! I've committed the diff. > Expose Balancer metrics through Metrics2 > > > Key: HDFS-10648 > URL: https://issues.apache.org/jira/browse/HDFS-10648 > Project: Hadoop HDFS > Issue Type: New Feature > Components: balancer mover, metrics >Reporter: Mark Wagner >Assignee: Leon Gao >Priority: Major > Labels: metrics, pull-request-available > Fix For: 3.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > The Balancer currently prints progress information to the console. For > deployments that run the balancer frequently, it would be helpful to collect > those metrics for publishing to the available sinks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16224) testBalancerWithObserverWithFailedNode times out
[ https://issues.apache.org/jira/browse/HDFS-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-16224. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed I've committed the fix. Thanks [~LeonG]! > testBalancerWithObserverWithFailedNode times out > > > Key: HDFS-16224 > URL: https://issues.apache.org/jira/browse/HDFS-16224 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Trivial > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > testBalancerWithObserverWithFailedNode fails intermittently. > > Seems it is because of datanode cannot shutdown because we need to wait for > datanodes to finish retries to failed observer. > > Jenkins report: > > [ERROR] > testBalancerWithObserverWithFailedNode(org.apache.hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes) > Time elapsed: 180.144 s <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 18 > milliseconds at java.lang.Object.wait(Native Method) at > java.lang.Thread.join(Thread.java:1252) at > java.lang.Thread.join(Thread.java:1326) at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.join(BPServiceActor.java:632) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.join(BPOfferService.java:360) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.shutDownAll(BlockPoolManager.java:119) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:2169) > at > org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNode(MiniDFSCluster.java:2166) > at > org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:2156) > at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2135) > at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2109) > at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2102) > at > org.apache.hadoop.hdfs.qjournal.MiniQJMHACluster.shutdown(MiniQJMHACluster.java:189) > at > org.apache.hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes.testBalancerWithObserver(TestBalancerWithHANameNodes.java:240) > at > org.apache.hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes.testBalancerWithObserverWithFailedNode(TestBalancerWithHANameNodes.java:197) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16133) Support refresh of IP addresses behind DNS for clients
[ https://issues.apache.org/jira/browse/HDFS-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384672#comment-17384672 ] Jing Zhao commented on HDFS-16133: -- Sure, done > Support refresh of IP addresses behind DNS for clients > -- > > Key: HDFS-16133 > URL: https://issues.apache.org/jira/browse/HDFS-16133 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Srinidhi V K >Priority: Minor > > Support for using a single DNS for clients was added as part of HDFS-14118. > Java client does the resolution once and caches it. This causes a problem > whenever a node is added or removed behind DNS. The idea with this task is to > handle this scenario and refresh the IP addresses automatically in Java > client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15842) HDFS mover to emit metrics
[ https://issues.apache.org/jira/browse/HDFS-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-15842. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed +1. Thank you for the contribution, [~LeonG]! > HDFS mover to emit metrics > -- > > Key: HDFS-15842 > URL: https://issues.apache.org/jira/browse/HDFS-15842 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > We can emit metrics thru metrics2 when running HDFS mover, which can help to > monitor the progress and turn mover parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15781) Add metrics for how blocks are moved in replaceBlock
[ https://issues.apache.org/jira/browse/HDFS-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-15781. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed I've committed the patch. Thank you for the contribution, [~LeonG] ! > Add metrics for how blocks are moved in replaceBlock > > > Key: HDFS-15781 > URL: https://issues.apache.org/jira/browse/HDFS-15781 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > We can add some metrics for to track how the blocks are being moved, to get > a sense of the locality of movements. > * How many blocks copied to local host? > * How many blocks moved to local disk thru hardlink? > * How many blocks are copied out of the host > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15683) Allow configuring DISK/ARCHIVE capacity for individual volumes
[ https://issues.apache.org/jira/browse/HDFS-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-15683: - Release Note: Add a new configuration "dfs.datanode.same-disk-tiering.capacity-ratio.percentage" to allow admins to configure capacity for individual volumes on the same mount. > Allow configuring DISK/ARCHIVE capacity for individual volumes > -- > > Key: HDFS-15683 > URL: https://issues.apache.org/jira/browse/HDFS-15683 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h > Remaining Estimate: 0h > > This is a follow-up task for https://issues.apache.org/jira/browse/HDFS-15548 > In case that the datanode disks are not unified, we should allow admins to > configure capacity for individual volumes on top of the default one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15683) Allow configuring DISK/ARCHIVE capacity for individual volumes
[ https://issues.apache.org/jira/browse/HDFS-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-15683. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Allow configuring DISK/ARCHIVE capacity for individual volumes > -- > > Key: HDFS-15683 > URL: https://issues.apache.org/jira/browse/HDFS-15683 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h > Remaining Estimate: 0h > > This is a follow-up task for https://issues.apache.org/jira/browse/HDFS-15548 > In case that the datanode disks are not unified, we should allow admins to > configure capacity for individual volumes on top of the default one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15683) Allow configuring DISK/ARCHIVE capacity for individual volumes
[ https://issues.apache.org/jira/browse/HDFS-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281326#comment-17281326 ] Jing Zhao commented on HDFS-15683: -- +1. I've committed the patch. Thank you for the contribution, [~LeonG]! BTW, the javac warnings in the builds look unrelated. It reported "generated 14 new + 580 unchanged - 14 fixed = 594 total (was 594)" and the warnings are not caused by this PR. But we can use this chance to fix these warnings. Please file a new jira to do that, @LeonGao91 > Allow configuring DISK/ARCHIVE capacity for individual volumes > -- > > Key: HDFS-15683 > URL: https://issues.apache.org/jira/browse/HDFS-15683 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > This is a follow-up task for https://issues.apache.org/jira/browse/HDFS-15548 > In case that the datanode disks are not unified, we should allow admins to > configure capacity for individual volumes on top of the default one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15683) Allow configuring DISK/ARCHIVE capacity for individual volumes
[ https://issues.apache.org/jira/browse/HDFS-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281326#comment-17281326 ] Jing Zhao edited comment on HDFS-15683 at 2/8/21, 7:35 PM: --- +1. I've committed the patch. Thank you for the contribution, [~LeonG]! BTW, the javac warnings in the builds look unrelated. It reported "generated 14 new + 580 unchanged - 14 fixed = 594 total (was 594)" and the warnings are not caused by this PR. But we can use this chance to fix these warnings. Please file a new jira to do that, [~LeonG] was (Author: jingzhao): +1. I've committed the patch. Thank you for the contribution, [~LeonG]! BTW, the javac warnings in the builds look unrelated. It reported "generated 14 new + 580 unchanged - 14 fixed = 594 total (was 594)" and the warnings are not caused by this PR. But we can use this chance to fix these warnings. Please file a new jira to do that, @LeonGao91 > Allow configuring DISK/ARCHIVE capacity for individual volumes > -- > > Key: HDFS-15683 > URL: https://issues.apache.org/jira/browse/HDFS-15683 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > This is a follow-up task for https://issues.apache.org/jira/browse/HDFS-15548 > In case that the datanode disks are not unified, we should allow admins to > configure capacity for individual volumes on top of the default one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15549) Use Hardlink to move replica between DISK and ARCHIVE storage if on same filesystem mount
[ https://issues.apache.org/jira/browse/HDFS-15549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-15549. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed +1. Thank you for the contribution, [~LeonG]! > Use Hardlink to move replica between DISK and ARCHIVE storage if on same > filesystem mount > - > > Key: HDFS-15549 > URL: https://issues.apache.org/jira/browse/HDFS-15549 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 3h > Remaining Estimate: 0h > > When moving blocks between DISK/ARCHIVE, we should prefer the volume on the > same underlying filesystem and use "rename" instead of "copy" to save IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15549) Use Hardlink to move replica between DISK and ARCHIVE storage if on same filesystem mount
[ https://issues.apache.org/jira/browse/HDFS-15549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-15549: - Summary: Use Hardlink to move replica between DISK and ARCHIVE storage if on same filesystem mount (was: Improve DISK/ARCHIVE movement if they are on same filesystem) > Use Hardlink to move replica between DISK and ARCHIVE storage if on same > filesystem mount > - > > Key: HDFS-15549 > URL: https://issues.apache.org/jira/browse/HDFS-15549 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > When moving blocks between DISK/ARCHIVE, we should prefer the volume on the > same underlying filesystem and use "rename" instead of "copy" to save IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem
[ https://issues.apache.org/jira/browse/HDFS-15549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261690#comment-17261690 ] Jing Zhao commented on HDFS-15549: -- Thank you for working on this, [~LeonG]! I went through the current patch and it looks good to me in general. Please feel free to upload the complete version when it's ready. > Improve DISK/ARCHIVE movement if they are on same filesystem > > > Key: HDFS-15549 > URL: https://issues.apache.org/jira/browse/HDFS-15549 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > When moving blocks between DISK/ARCHIVE, we should prefer the volume on the > same underlying filesystem and use "rename" instead of "copy" to save IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14904) Add Option to let Balancer prefer highly utilized nodes in each iteration
[ https://issues.apache.org/jira/browse/HDFS-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242816#comment-17242816 ] Jing Zhao commented on HDFS-14904: -- +1. I've committed the change. Thank you for the contribution, [~LeonG]! > Add Option to let Balancer prefer highly utilized nodes in each iteration > - > > Key: HDFS-14904 > URL: https://issues.apache.org/jira/browse/HDFS-14904 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Normally the most important purpose for HDFS balancer is to reduce the top > used node to prevent datanode usage from being too high. > Currently, balancer almost randomly picks nodes as sources regardless of > usage, which makes it slow to bring down the top used datanodes in the > cluster, when there are less underutilized nodes in the cluster (consider > expansion). > We can add an option to prefer top used nodes first in each iteration, as > suggested in HDFS-14894 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14904) Add Option to let Balancer prefer highly utilized nodes in each iteration
[ https://issues.apache.org/jira/browse/HDFS-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-14904. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Add Option to let Balancer prefer highly utilized nodes in each iteration > - > > Key: HDFS-14904 > URL: https://issues.apache.org/jira/browse/HDFS-14904 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Normally the most important purpose for HDFS balancer is to reduce the top > used node to prevent datanode usage from being too high. > Currently, balancer almost randomly picks nodes as sources regardless of > usage, which makes it slow to bring down the top used datanodes in the > cluster, when there are less underutilized nodes in the cluster (consider > expansion). > We can add an option to prefer top used nodes first in each iteration, as > suggested in HDFS-14894 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14904) Add Option to let Balancer prefer highly utilized nodes in each iteration
[ https://issues.apache.org/jira/browse/HDFS-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-14904: - Summary: Add Option to let Balancer prefer highly utilized nodes in each iteration (was: Option to let Balancer prefer top used nodes in each iteration) > Add Option to let Balancer prefer highly utilized nodes in each iteration > - > > Key: HDFS-14904 > URL: https://issues.apache.org/jira/browse/HDFS-14904 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Normally the most important purpose for HDFS balancer is to reduce the top > used node to prevent datanode usage from being too high. > Currently, balancer almost randomly picks nodes as sources regardless of > usage, which makes it slow to bring down the top used datanodes in the > cluster, when there are less underutilized nodes in the cluster (consider > expansion). > We can add an option to prefer top used nodes first in each iteration, as > suggested in HDFS-14894 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15548) Allow configuring DISK/ARCHIVE storage types on same device mount
[ https://issues.apache.org/jira/browse/HDFS-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-15548. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed I've committed the change. Thank you for the great work, [~LeonG]! Thanks for the review, [~hexiaoqiao]! > Allow configuring DISK/ARCHIVE storage types on same device mount > - > > Key: HDFS-15548 > URL: https://issues.apache.org/jira/browse/HDFS-15548 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 9.5h > Remaining Estimate: 0h > > We can allow configuring DISK/ARCHIVE storage types on the same device mount > on two separate directories. > Users should be able to configure the capacity for each. Also, the datanode > usage report should report stats correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15548) Allow configuring DISK/ARCHIVE storage types on same device mount
[ https://issues.apache.org/jira/browse/HDFS-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227065#comment-17227065 ] Jing Zhao commented on HDFS-15548: -- The current PR looks good to me. Do you also want to take another look, [~hexiaoqiao]? > Allow configuring DISK/ARCHIVE storage types on same device mount > - > > Key: HDFS-15548 > URL: https://issues.apache.org/jira/browse/HDFS-15548 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 8.5h > Remaining Estimate: 0h > > We can allow configuring DISK/ARCHIVE storage types on the same device mount > on two separate directories. > Users should be able to configure the capacity for each. Also, the datanode > usage report should report stats correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11797) BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException when corrupt replicas are inconsistent
[ https://issues.apache.org/jira/browse/HDFS-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042052#comment-16042052 ] Jing Zhao edited comment on HDFS-11797 at 6/8/17 1:29 AM: -- I have not checked the details, but is it related to HDFS-11445 (more specifically, this [comment|https://issues.apache.org/jira/browse/HDFS-11445?focusedCommentId=15898236=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898236]) ? was (Author: jingzhao): I have not checked the details, but is it related to HDFS-11445 (more specifically, this [comment|https://issues.apache.org/jira/browse/HDFS-11445?focusedCommentId=15898236=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898236]). > BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException > when corrupt replicas are inconsistent > -- > > Key: HDFS-11797 > URL: https://issues.apache.org/jira/browse/HDFS-11797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Critical > Attachments: HDFS-11797.001.patch > > > The calculation for {{numMachines}} can be too less (causing > ArrayIndexOutOfBoundsException) or too many (causing NPE (HDFS-9958)) if data > structures find inconsistent number of corrupt replicas. This was earlier > found related to failed storages. This JIRA tracks a change that works for > all possible cases of inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11797) BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException when corrupt replicas are inconsistent
[ https://issues.apache.org/jira/browse/HDFS-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042052#comment-16042052 ] Jing Zhao edited comment on HDFS-11797 at 6/8/17 1:29 AM: -- I have not checked the details, but is it related to HDFS-11445 (more specifically, this [comment|https://issues.apache.org/jira/browse/HDFS-11445?focusedCommentId=15898236=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898236]). was (Author: jingzhao): I have not checked the details, but is it related to HDFS-11445? > BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException > when corrupt replicas are inconsistent > -- > > Key: HDFS-11797 > URL: https://issues.apache.org/jira/browse/HDFS-11797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Critical > Attachments: HDFS-11797.001.patch > > > The calculation for {{numMachines}} can be too less (causing > ArrayIndexOutOfBoundsException) or too many (causing NPE (HDFS-9958)) if data > structures find inconsistent number of corrupt replicas. This was earlier > found related to failed storages. This JIRA tracks a change that works for > all possible cases of inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11797) BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException when corrupt replicas are inconsistent
[ https://issues.apache.org/jira/browse/HDFS-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042052#comment-16042052 ] Jing Zhao commented on HDFS-11797: -- I have not checked the details, but is it related to HDFS-11445? > BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException > when corrupt replicas are inconsistent > -- > > Key: HDFS-11797 > URL: https://issues.apache.org/jira/browse/HDFS-11797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Critical > Attachments: HDFS-11797.001.patch > > > The calculation for {{numMachines}} can be too less (causing > ArrayIndexOutOfBoundsException) or too many (causing NPE (HDFS-9958)) if data > structures find inconsistent number of corrupt replicas. This was earlier > found related to failed storages. This JIRA tracks a change that works for > all possible cases of inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11823) Extend TestDFSStripedIutputStream/TestDFSStripedOutputStream with a random EC policy
[ https://issues.apache.org/jira/browse/HDFS-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11823: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 Status: Resolved (was: Patch Available) I've committed the patch. > Extend TestDFSStripedIutputStream/TestDFSStripedOutputStream with a random EC > policy > > > Key: HDFS-11823 > URL: https://issues.apache.org/jira/browse/HDFS-11823 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding, test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Labels: hdfs-ec-3.0-nice-to-have > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11823.1.patch > > > From the discussion in HDFS-7866 and HDFS-9962, in addtion to the default ec > policy, it would be good if we add a random ec policy to each test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11823) Extend TestDFSStripedIutputStream/TestDFSStripedOutputStream with a random EC policy
[ https://issues.apache.org/jira/browse/HDFS-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023376#comment-16023376 ] Jing Zhao commented on HDFS-11823: -- Thanks for working on this, [~tasanuma0829]! The patch looks good to me. The failed tests are unrelated. +1 > Extend TestDFSStripedIutputStream/TestDFSStripedOutputStream with a random EC > policy > > > Key: HDFS-11823 > URL: https://issues.apache.org/jira/browse/HDFS-11823 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding, test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-11823.1.patch > > > From the discussion in HDFS-7866 and HDFS-9962, in addtion to the default ec > policy, it would be good if we add a random ec policy to each test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11445) FSCK shows overall health stauts as corrupt even one replica is corrupt
[ https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016362#comment-16016362 ] Jing Zhao commented on HDFS-11445: -- Thanks for updating the patch, [~brahmareddy]. The patch looks good to me. +1 after fixing the new checkstyle warnings. > FSCK shows overall health stauts as corrupt even one replica is corrupt > --- > > Key: HDFS-11445 > URL: https://issues.apache.org/jira/browse/HDFS-11445 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: HDFS-11445-002.patch, HDFS-11445-003.patch, > HDFS-11445.patch > > > In the following scenario,FSCK shows overall health status as corrupt even > it's has one good replica. > 1. Create file with 2 RF. > 2. Shutdown one DN > 3. Append to file again. > 4. Restart the DN > 5. After block report, check Fsck -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11445) FSCK shows overall health stauts as corrupt even one replica is corrupt
[ https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013006#comment-16013006 ] Jing Zhao commented on HDFS-11445: -- Thanks for the patch, [~brahmareddy]. For the current patch, instead of passing BlockManager to {{commitBlock}} and {{setGenerationStampAnderifyReplcias}}, why not letting these methods return the list of stale replicas and then removing these stored blockInfo in BlockManager? > FSCK shows overall health stauts as corrupt even one replica is corrupt > --- > > Key: HDFS-11445 > URL: https://issues.apache.org/jira/browse/HDFS-11445 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: HDFS-11445-002.patch, HDFS-11445.patch > > > In the following scenario,FSCK shows overall health status as corrupt even > it's has one good replica. > 1. Create file with 2 RF. > 2. Shutdown one DN > 3. Append to file again. > 4. Restart the DN > 5. After block report, check Fsck -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11448) JN log segment syncing should support HA upgrade
[ https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997365#comment-15997365 ] Jing Zhao commented on HDFS-11448: -- +1 for the 03 patch. Thank you for the work [~hanishakoneru]! Thank you for the review, [~arpitagarwal]! > JN log segment syncing should support HA upgrade > > > Key: HDFS-11448 > URL: https://issues.apache.org/jira/browse/HDFS-11448 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11448.001.patch, HDFS-11448.002.patch, > HDFS-11448.003.patch > > > HDFS-4025 adds support for sychronizing past log segments to JNs that missed > them. But, as pointed out by [~jingzhao], if the segment download happens > when an admin tries to rollback, it might fail ([see > comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11448) JN log segment syncing should support HA upgrade
[ https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995387#comment-15995387 ] Jing Zhao commented on HDFS-11448: -- Thanks for continuing working on this, [~hanishakoneru]. And thanks for the review, [~arpitagarwal]. I have 2 further minor comments: # I'm not very sure if we still need to check if the current directory exists when starting a sync iteration. We have already checked if current directory exists while initializing the journal. Then during the upgrade/rollback, both the {{moveTmpSegmentToCurrent}} and {{doRollback}} hold the Journal object's monitor, {{moveTmpSegmentToCurrent}} also checks the {{committedTxnId}} before moving. Thus to me it is not necessary to have this current directory check in {{canJournalSync}}. {code} public boolean canJournalSync() { // JN should not sync if there is no current directory (during upgrade or // rollback). return storage.getCurrentDir().exists(); } {code} # About the name of the temporary directory: maybe we can have a more specific name like "edits.tmp" or "edits.sync"? Otherwise the patch looks good to me. > JN log segment syncing should support HA upgrade > > > Key: HDFS-11448 > URL: https://issues.apache.org/jira/browse/HDFS-11448 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11448.001.patch > > > HDFS-4025 adds support for sychronizing past log segments to JNs that missed > them. But, as pointed out by [~jingzhao], if the segment download happens > when an admin tries to rollback, it might fail ([see > comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11395: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.9.0 Status: Resolved (was: Patch Available) I've committed the patch. Thanks for the contribution, [~nandakumar131]! Thanks for the review, [~arpitagarwal]! > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Fix For: 2.9.0 > > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch, HDFS-11395.003.patch, HDFS-11395.004.patch, > HDFS-11395.005.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15922957#comment-15922957 ] Jing Zhao commented on HDFS-11395: -- +1. I will commit the patch shortly. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch, HDFS-11395.003.patch, HDFS-11395.004.patch, > HDFS-11395.005.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899906#comment-15899906 ] Jing Zhao commented on HDFS-11395: -- hmmm looks like the testing node has not got fixed yet... > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch, HDFS-11395.003.patch, HDFS-11395.004.patch, > HDFS-11395.005.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899900#comment-15899900 ] Jing Zhao commented on HDFS-11395: -- The 005 patch looks good to me. Let me trigger the Jenkins. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch, HDFS-11395.003.patch, HDFS-11395.004.patch, > HDFS-11395.005.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11445) FSCK shows overall health stauts as corrupt even one replica is corrupt
[ https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898236#comment-15898236 ] Jing Zhao commented on HDFS-11445: -- Thanks for working on this, [~brahmareddy]! I think you just find a scenario that the following inconsistency happens: {code:title=BlockManager#createLocatedBlock} NumberReplicas numReplicas = countNodes(blk); final int numCorruptNodes = numReplicas.corruptReplicas(); final int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blk); if (numCorruptNodes != numCorruptReplicas) { LOG.warn("Inconsistent number of corrupt replicas for " + blk + " blockMap has " + numCorruptNodes + " but corrupt replicas map has " + numCorruptReplicas); } {code} I also did some debugging using your unit test. Looks like the root cause for this inconsistency is: {{BlockInfo#setGenerationStampAndVerifyReplicas}} may remove a datanode storage from the block's storage list, but still leave the storage in the CorruptReplicasMap. This inconsistency later can be fixed automatically, e.g., by a full block report. But maybe we should consider using {{BlockManager#removeStoredBlock(BlockInfo, DatanodeDescriptor)}} to remove all the records related to the block-dn pair. > FSCK shows overall health stauts as corrupt even one replica is corrupt > --- > > Key: HDFS-11445 > URL: https://issues.apache.org/jira/browse/HDFS-11445 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: HDFS-11445.patch > > > In the following scenario,FSCK shows overall health status as corrupt even > it's has one good replica. > 1. Create file with 2 RF. > 2. Shutdown one DN > 3. Append to file again. > 4. Restart the DN > 5. After block report, check Fsck -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897830#comment-15897830 ] Jing Zhao commented on HDFS-11395: -- The 004 patch looks good to me. Just some minors: # Here since we're sure e is a MultiException, we can directly catch "MultiException e". {code} 380 } catch (Exception e) { 381 for (Exception ex : ((MultiException)e).getExceptions().values()) { {code} # The following code can be simplified as "Assert.assertTrue("..", rEx instanceof StandbyException)" {code} if (rEx instanceof StandbyException) { continue; } else { Assert.fail("Unexpected RemoteException: " + rEx.getMessage()); } {code} # "@param ex" can be removed. {code} * @param ex * @return unwrapped exception */ private Exception unwrapException(Exception ex) { {code} > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch, HDFS-11395.003.patch, HDFS-11395.004.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11476) Fix NPE in FsDatasetImpl#checkAndUpdate
[ https://issues.apache.org/jira/browse/HDFS-11476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11476: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.9.0 Target Version/s: (was: 2.8.0) Status: Resolved (was: Patch Available) I've committed this to trunk and branch-2. Thanks [~xiaobingo] for the fix! Thanks [~arpitagarwal] and [~liuml07] for the review. > Fix NPE in FsDatasetImpl#checkAndUpdate > --- > > Key: HDFS-11476 > URL: https://issues.apache.org/jira/browse/HDFS-11476 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaobing Zhou >Assignee: Xiaobing Zhou > Fix For: 2.9.0 > > Attachments: HDFS-11476.000.patch, HDFS-11476.001.patch, > HDFS-11476.002.patch, HDFS-11476.003.patch > > > diskMetaFile can be null and passed to compareTo which dereferences it, > causing NPE > {code} > // Compare generation stamp > if (memBlockInfo.getGenerationStamp() != diskGS) { > File memMetaFile = FsDatasetUtil.getMetaFile(diskFile, > memBlockInfo.getGenerationStamp()); > if (memMetaFile.exists()) { > if (memMetaFile.compareTo(diskMetaFile) != 0) { > LOG.warn("Metadata file in memory " > + memMetaFile.getAbsolutePath() > + " does not match file found by scan " > + (diskMetaFile == null? null: > diskMetaFile.getAbsolutePath())); > } > } else { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894925#comment-15894925 ] Jing Zhao commented on HDFS-11395: -- Thanks for updating the patch, [~nandakumar131]. Some further comments on the latest patch: # In RetryAction, the usage of the new member field {{exception}} is for FAIL case. Thus maybe we can: #* rename "exception" to failException #* assign a value to this field only when the action is FAIL: {code} + Exception ex = null; {code} # Looks like we should not only get {{RemoteException}} out of {{ex}}, but more generally get the cause of other types of exceptions. Note exceptions like ConnectException and EOFException should also be exposed to retry policies. {code} RemoteException rEx = getRemoteException(ex); if(rEx != null) { badResults.put(tProxyInfo.proxyInfo, rEx); } else { badResults.put(tProxyInfo.proxyInfo, ex); } {code} # It will be helpful if we can also have a unit test for the above ConnectException/EOFException case. # Need to fix indentation, line length etc. for {{testHedgingWhenFileNotFoundException}}. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch, > HDFS-11395.002.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893227#comment-15893227 ] Jing Zhao commented on HDFS-11395: -- bq. in case of non RemoteException from ExecutionException, what should be done That should be OK. The current retry policies are supposed to handle all exceptions including non RemoteException. E.g., {{FailoverOnNetworkExceptionRetry}} handles {{ConnectException}}, {{EOFException}} etc. bq. In that case, is it ok to add an additional field to RetryInvocationHandler#RetryInfo for holding Exception Yes, sounds good to me. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890867#comment-15890867 ] Jing Zhao edited comment on HDFS-11395 at 3/1/17 7:27 PM: -- Thanks for working on this, [~nandakumar131]. I agree we should not directly throw a MultiException. But I have similar concern as Arpit, i.e., we should not simply throw the first exception. I think we should # Not mix detailed exception handling logic into {{RequestHedgingProxyProvider}}. In {{RequestHedgingProxyProvider}}, we only need to get the RemoteException from {{ExecutionException}}, and put all the exceptions into {{badResults}}. No need for special handling for StandbyException etc there. These should be handled by {{RetryInvocationHandler#newRetryInfo}}. # Then in {{RetryInvocationHandler#newRetryInfo}}, we should let this method return both the RetryInfo and the exception to throw from the MultiException. These two information should comes from the same internal exception inside of the MultiException. was (Author: jingzhao): Thanks for working on this, [~nandakumar131]. I agree we should not directly throw a MultiException. But I have similar concern as Arpit, i.e., we should not simply throw the first exception. I think we should # Not mix detailed exception handling logic into {{RequestHedgingProxyProvider}}. In {{RequestHedgingProxyProvider}}, we only need to get the RemoteException from {{ExecutionException}}, and put all the exceptions into {{badResults}}. No need for special handling for StandbyException etc there. These should be handled by {{RetryInvocationHandler#newRetryInfo}}. # Then in {{RetryInvocationHandler#newRetryInfo}}, we should let this method return both the RetryInfo and the exception to throw from the MultiException. These two information should comes from the same internal exception inside of the MultiException. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11395) RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the Exception thrown from NameNode
[ https://issues.apache.org/jira/browse/HDFS-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890867#comment-15890867 ] Jing Zhao commented on HDFS-11395: -- Thanks for working on this, [~nandakumar131]. I agree we should not directly throw a MultiException. But I have similar concern as Arpit, i.e., we should not simply throw the first exception. I think we should # Not mix detailed exception handling logic into {{RequestHedgingProxyProvider}}. In {{RequestHedgingProxyProvider}}, we only need to get the RemoteException from {{ExecutionException}}, and put all the exceptions into {{badResults}}. No need for special handling for StandbyException etc there. These should be handled by {{RetryInvocationHandler#newRetryInfo}}. # Then in {{RetryInvocationHandler#newRetryInfo}}, we should let this method return both the RetryInfo and the exception to throw from the MultiException. These two information should comes from the same internal exception inside of the MultiException. > RequestHedgingProxyProvider#RequestHedgingInvocationHandler hides the > Exception thrown from NameNode > > > Key: HDFS-11395 > URL: https://issues.apache.org/jira/browse/HDFS-11395 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Nandakumar >Assignee: Nandakumar > Attachments: HDFS-11395.000.patch, HDFS-11395.001.patch > > > When using RequestHedgingProxyProvider, in case of Exception (like > FileNotFoundException) from ActiveNameNode, > {{RequestHedgingProxyProvider#RequestHedgingInvocationHandler.invoke}} > receives {{ExecutionException}} since we use {{CompletionService}} for the > call. The ExecutionException is put into a map and wrapped with > {{MultiException}}. > So for a FileNotFoundException the client receives > {{MultiException(Map(ExecutionException(InvocationTargetException(RemoteException(FileNotFoundException)}} > It will cause problem in clients which are handling RemoteExceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Attachment: HDFS-8498.branch-2.001.patch Thanks for the review, [~jojochuang]. Update the branch-2 patch to address your comments. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch, > HDFS-8498.branch-2.001.patch, HDFS-8498.branch-2.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4025: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 Status: Resolved (was: Patch Available) I've committed the patch to trunk. Thanks for the contribution Hanisha! > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: 3.0.0-alpha3, QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch, HDFS-4025.007.patch, > HDFS-4025.008.patch, HDFS-4025.009.patch, HDFS-4025.010.patch, > HDFS-4025.011.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879572#comment-15879572 ] Jing Zhao commented on HDFS-4025: - The latest patch looks good to me. +1. I will commit the patch shortly. [~hanishakoneru], please create another jira to address the remaining issues as we discussed. > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch, HDFS-4025.007.patch, > HDFS-4025.008.patch, HDFS-4025.009.patch, HDFS-4025.010.patch, > HDFS-4025.011.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11402) HDFS Snapshots should capture point-in-time copies of OPEN files
[ https://issues.apache.org/jira/browse/HDFS-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872794#comment-15872794 ] Jing Zhao commented on HDFS-11402: -- bq. We have the same problem with parallel readers when there is an ongoing write. This is not true. The current DFSInputstream has the capability to talk to datanodes and learn the current length. > HDFS Snapshots should capture point-in-time copies of OPEN files > > > Key: HDFS-11402 > URL: https://issues.apache.org/jira/browse/HDFS-11402 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 2.6.0 >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy > Attachments: HDFS-11402.01.patch, HDFS-11402.02.patch > > > *Problem:* > 1. When there are files being written and when HDFS Snapshots are taken in > parallel, Snapshots do capture all these files, but these being written files > in Snapshots do not have the point-in-time file length captured. That is, > these open files are not frozen in HDFS Snapshots. These open files > grow/shrink in length, just like the original file, even after the snapshot > time. > 2. At the time of File close or any other meta data modification operation on > these files, HDFS reconciles the file length and records the modification in > the last taken Snapshot. All the previously taken Snapshots continue to have > those open Files with no modification recorded. So, all those previous > snapshots end up using the final modification record in the last snapshot. > Thus after the file close, file lengths in all those snapshots will end up > same. > Assume File1 is opened for write and a total of 1MB written to it. While the > writes are happening, snapshots are taken in parallel. > {noformat} > |---Time---T1---T2-T3T4--> > |---Snap1--Snap2-Snap3---> > |---File1.open---write-write---close-> > {noformat} > Then at time, > T2: > Snap1.File1.length = 0 > T3: > Snap1.File1.length = 0 > Snap2.File1.length = 0 > > T4: > Snap1.File1.length = 1MB > Snap2.File1.length = 1MB > Snap3.File1.length = 1MB > *Proposal* > 1. At the time of taking Snapshot, {{SnapshotManager#createSnapshot}} can > optionally request {{DirectorySnapshottableFeature#addSnapshot}} to freeze > open files. > 2. {{DirectorySnapshottableFeature#addSnapshot}} can consult with > {{LeaseManager}} and get a list INodesInPath for all open files under the > snapshot dir. > 3. {{DirectorySnapshottableFeature#addSnapshot}} after the Snapshot creation, > Diff creation and updating modification time, can invoke > {{INodeFile#recordModification}} for each of the open files. This way, the > Snapshot just taken will have a {{FileDiff}} with {{fileSize}} captured for > each of the open files. > 4. Above model follows the current Snapshot and Diff protocols and doesn't > introduce any any disk formats. So, I don't think we will be needing any new > FSImage Loader/Saver changes for Snapshots. > 5. One of the design goals of HDFS Snapshot was ability to take any number of > snapshots in O(1) time. LeaseManager though has all the open files with > leases in-memory map, an iteration is still needed to prune the needed open > files and then run recordModification on each of them. So, it will not be a > strict O(1) with the above proposal. But, its going be a marginal increase > only as the new order will be of O(open_files_under_snap_dir). In order to > avoid HDFS Snapshots change in behavior for open files and avoid change in > time complexity, this improvement can be made under a new config > {{"dfs.namenode.snapshot.freeze.openfiles"}} which by default can be > {{false}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11402) HDFS Snapshots should capture point-in-time copies of OPEN files
[ https://issues.apache.org/jira/browse/HDFS-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872739#comment-15872739 ] Jing Zhao commented on HDFS-11402: -- Thanks for working on this, [~manojg]. Yes the length of open files is a known issue for the current snapshot implementation. As you mentioned in the description, the current semantic is to capture a length in snapshot which is >= the real length when the snapshot was created. This behavior kind of breaks the read-only semantic. I think the key challenge here is how to let NN know the lengths of open files. Currently the length of an open file is updated on the NN only when 1) the first time hflush is called, or 2) hflush/hsync is called along with the UPDATE_LENGTH flag. Thus if a file is being written, the file length on the NN side (let's call it {{l_n}}) is usually a lot less than the length seen by the DN/client. If we choose to record {{l_n}} in the snapshot, then later we may have risk to lose data (from client's point of view). E.g., a user wrote 100MB data and took a snapshot. The {{l_n}} at that time might be only 1MB or even 0. Later if the user deletes the file she will expect ~100MB still kept in the snapshotted file, instead of 1MB or an empty file. At this time, from the safety point of view, maybe the semantic of the current snapshot implementation is better. So before we update the NN side logic about capturing the file length in snapshots, I think we first need to solve the problem about how to report the length of open files to NN (e.g., maybe utilizing the DN heartbeats or some other ways). > HDFS Snapshots should capture point-in-time copies of OPEN files > > > Key: HDFS-11402 > URL: https://issues.apache.org/jira/browse/HDFS-11402 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 2.6.0 >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy > Attachments: HDFS-11402.01.patch, HDFS-11402.02.patch > > > *Problem:* > 1. When there are files being written and when HDFS Snapshots are taken in > parallel, Snapshots do capture all these files, but these being written files > in Snapshots do not have the point-in-time file length captured. That is, > these open files are not frozen in HDFS Snapshots. These open files > grow/shrink in length, just like the original file, even after the snapshot > time. > 2. At the time of File close or any other meta data modification operation on > these files, HDFS reconciles the file length and records the modification in > the last taken Snapshot. All the previously taken Snapshots continue to have > those open Files with no modification recorded. So, all those previous > snapshots end up using the final modification record in the last snapshot. > Thus after the file close, file lengths in all those snapshots will end up > same. > Assume File1 is opened for write and a total of 1MB written to it. While the > writes are happening, snapshots are taken in parallel. > {noformat} > |---Time---T1---T2-T3T4--> > |---Snap1--Snap2-Snap3---> > |---File1.open---write-write---close-> > {noformat} > Then at time, > T2: > Snap1.File1.length = 0 > T3: > Snap1.File1.length = 0 > Snap2.File1.length = 0 > > T4: > Snap1.File1.length = 1MB > Snap2.File1.length = 1MB > Snap3.File1.length = 1MB > *Proposal* > 1. At the time of taking Snapshot, {{SnapshotManager#createSnapshot}} can > optionally request {{DirectorySnapshottableFeature#addSnapshot}} to freeze > open files. > 2. {{DirectorySnapshottableFeature#addSnapshot}} can consult with > {{LeaseManager}} and get a list INodesInPath for all open files under the > snapshot dir. > 3. {{DirectorySnapshottableFeature#addSnapshot}} after the Snapshot creation, > Diff creation and updating modification time, can invoke > {{INodeFile#recordModification}} for each of the open files. This way, the > Snapshot just taken will have a {{FileDiff}} with {{fileSize}} captured for > each of the open files. > 4. Above model follows the current Snapshot and Diff protocols and doesn't > introduce any any disk formats. So, I don't think we will be needing any new > FSImage Loader/Saver changes for Snapshots. > 5. One of the design goals of HDFS Snapshot was ability to take any number of > snapshots in O(1) time. LeaseManager though has all the open files with > leases in-memory map, an iteration is still needed to prune the needed open > files and then run recordModification on each of them. So, it will not be a > strict O(1) with the above proposal. But, its going be a marginal increase > only as the new order will be of O(open_files_under_snap_dir). In order to
[jira] [Reopened] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao reopened HDFS-8498: - Reopen for the branch-2 patch. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch, > HDFS-8498.branch-2.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Attachment: HDFS-8498.branch-2.patch Upload the patch for branch-2 > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch, > HDFS-8498.branch-2.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868412#comment-15868412 ] Jing Zhao commented on HDFS-8498: - [~jojochuang], currently I do not plan to backport this change to branch 2.x. But please feel free to do it if you think it's necessary and I will be happy to review. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868333#comment-15868333 ] Jing Zhao commented on HDFS-8498: - I've committed the patch into trunk. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 Status: Resolved (was: Patch Available) Thanks for the review, [~jnp]! I will commit the patch shortly. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858371#comment-15858371 ] Jing Zhao commented on HDFS-8498: - Do you also want to take a look at the patch, [~vinayrpet]? > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850633#comment-15850633 ] Jing Zhao commented on HDFS-4025: - The failed unit test should be unrelated and has been reported in HDFS-10644. In the meanwhile, the current patch may still hit an issue while HA upgrade is going on. If the segment downloading is happening while the admin tries to rollback, the deletion of the {{current}} directory may fail on Windows. As a fix we can disable the sync while there is {{prev}} directory on JN (which means the upgrade is still going on). Or we can download the segment first into another directory. Currently I'm thinking maybe we can disable this feature in the configuration by default, then use separate jiras to track remaining issues. This also allows us to do more testing. Thoughts? > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch, HDFS-4025.007.patch, > HDFS-4025.008.patch, HDFS-4025.009.patch, HDFS-4025.010.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849052#comment-15849052 ] Jing Zhao commented on HDFS-4025: - Thanks for the updating the patch, [~hanishakoneru]. The latest patch looks pretty good to me. Some minor comments: # In hdfs-default.xml, "i" --> "if" {code} + dfs.journalnode.enable.sync + true + +If true, the journal nodes wil sync with each other. The journal nodes +will periodically gossip with other journal nodes to compare edit log +manifests and i they detect any missing log segment, they will download +it from the other journal nodes. + + {code} # In JournalNodeSyncer.java, the following code will generate an {{UnsupportedOperationException}} since thisJournalEditLogs is an immutable list. In fact this add op can be skipped. {code} if (success) { thisJournalEditLogs.add(missingLog); } {code} # Maybe "Transferring" can be changed to "Downloading"? {code} LOG.info("Transferring Missing Edit Log from " + url + " to " + jnStorage .getRoot()); {code} # {{finalEditsFile}} should be {{tmpEditsFile}}. {code} LOG.info("Downloaded file " + tmpEditsFile.getName() + " size " + finalEditsFile.length() + " bytes."); {code} # In {{TestJournalNodeSync}}, {{jid}} can be declared as final, and {{editLogExists}} can be private. # For {{deleteEditLog}}, we can either change the while loop to an if, or refresh logFile instance within the while loop. {code} + while (logFile.isInProgress()) { + dfsCluster.getNameNode(0).getRpcServer().rollEditLog(); {code} # The following code can be simplified as "Assert.assertTrue("Couldn't delete edit log file", deleteFile.delete());" {code} +if (!deleteFile.delete()) { + assert false: "Couldn't delete edit log file"; + return null; +} {code} # In {{generateEditLog}}, let's also check the result of {{doAndEdit}}. I.e., we do "Assert.assertTrue(doAnEdit());" > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch, HDFS-4025.007.patch, > HDFS-4025.008.patch, HDFS-4025.009.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11370) Optimize NamenodeFsck#getReplicaInfo
[ https://issues.apache.org/jira/browse/HDFS-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848812#comment-15848812 ] Jing Zhao commented on HDFS-11370: -- The latest patch looks good to me. And the failed test should be unrelated. +1 I will commit the patch shortly. Thanks for the work, [~tasanuma0829]! Thanks for the review, [~manojg]! > Optimize NamenodeFsck#getReplicaInfo > > > Key: HDFS-11370 > URL: https://issues.apache.org/jira/browse/HDFS-11370 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Minor > Attachments: HDFS-11370.1.patch, HDFS-11370.2.patch, > HDFS-11370.3.patch, HDFS-11370.4.patch, HDFS-11370.5.patch, HDFS-11370.6.patch > > > We can optimize the logic of calculating the number of storages in > {{NamenodeFsck#getReplicaInfo}}. This is a follow-on task of HDFS-11124. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11370) Optimize NamenodeFsck#getReplicaInfo
[ https://issues.apache.org/jira/browse/HDFS-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11370: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 Status: Resolved (was: Patch Available) I've committed this patch to trunk. > Optimize NamenodeFsck#getReplicaInfo > > > Key: HDFS-11370 > URL: https://issues.apache.org/jira/browse/HDFS-11370 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Minor > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11370.1.patch, HDFS-11370.2.patch, > HDFS-11370.3.patch, HDFS-11370.4.patch, HDFS-11370.5.patch, HDFS-11370.6.patch > > > We can optimize the logic of calculating the number of storages in > {{NamenodeFsck#getReplicaInfo}}. This is a follow-on task of HDFS-11124. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11370) Optimize NamenodeFsck#getReplicaInfo
[ https://issues.apache.org/jira/browse/HDFS-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847847#comment-15847847 ] Jing Zhao commented on HDFS-11370: -- Thanks for the discussion, [~manojg] and [~tasanuma0829]. I agree it's better to achieve thread safe for the new {{getExpectedStorageLocationsIterator}}. But currently almost all block related classes, from Block to BlockInfo to BlockUnderConstructionFeature, does not provide thread-safe guarantee and depend on external mechanism such as the FSNamesystem lock for protection. So I do not think we need to address this issue here, but maybe we can add a java doc for {{getExpectedStorageLocationsIterator}} explaining that the method is not thread-safe by itself. > Optimize NamenodeFsck#getReplicaInfo > > > Key: HDFS-11370 > URL: https://issues.apache.org/jira/browse/HDFS-11370 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Minor > Attachments: HDFS-11370.1.patch, HDFS-11370.2.patch, > HDFS-11370.3.patch, HDFS-11370.4.patch, HDFS-11370.5.patch > > > We can optimize the logic of calculating the number of storages in > {{NamenodeFsck#getReplicaInfo}}. This is a follow-on task of HDFS-11124. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11370) Optimize NamenodeFsck#getReplicaInfo
[ https://issues.apache.org/jira/browse/HDFS-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840282#comment-15840282 ] Jing Zhao commented on HDFS-11370: -- Thanks for working on this, [~tasanuma0829]. The current patch looks good to me. Some further thoughts: # In {{getReplicaInfo}} what we need is actually an iterator/iterable of storages (used by the for loop). However, currently we're using a storage[], and for completed blockInfo we always need to 1) allocate a storage[], 2) get an iterator of the storages, and 3) copy all the storages into the array. This is unnecessary. # So how about we provide an iterator/iterable in the UC feature to get all the expected locations? Then for completed blocks we can avoid the unnecessary copy. What do you think? > Optimize NamenodeFsck#getReplicaInfo > > > Key: HDFS-11370 > URL: https://issues.apache.org/jira/browse/HDFS-11370 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Minor > Attachments: HDFS-11370.1.patch > > > We can optimize the logic of calculating the number of storages in > {{NamenodeFsck#getReplicaInfo}}. This is a follow-on task of HDFS-11124. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11124) Report blockIds of internal blocks for EC files in Fsck
[ https://issues.apache.org/jira/browse/HDFS-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11124: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: (was: 3.0.0-alpha2) 3.0.0-alpha3 Target Version/s: (was: 3.0.0-alpha3) Status: Resolved (was: Patch Available) The failed tests also passed in my local machine. The patch looks good to me. +1. I've committed it to trunk. Thanks a lot for the contribution, [~tasanuma0829]! > Report blockIds of internal blocks for EC files in Fsck > --- > > Key: HDFS-11124 > URL: https://issues.apache.org/jira/browse/HDFS-11124 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Labels: hdfs-ec-3.0-nice-to-have > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11124.1.patch, HDFS-11124.2.patch, > HDFS-11124.3.patch > > > At the moment, when we do fsck for an EC file which has corrupt blocks and > missing blocks, the result of fsck is like this: > {quote} > /data/striped 393216 bytes, erasure-coded: policy=RS-DEFAULT-6-3-64k, 1 > block(s): > /data/striped: CORRUPT blockpool BP-1204772930-172.16.165.209-1478761131832 > block blk_-9223372036854775792 > CORRUPT 1 blocks of total size 393216 B > 0. BP-1204772930-172.16.165.209-1478761131832:blk_-9223372036854775792_1001 > len=393216 Live_repl=4 > [DatanodeInfoWithStorage[127.0.0.1:61617,DS-bcfebe1f-ff54-4d57-9258-ff5bdfde01b5,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61601,DS-9abf64d0-bb6b-434c-8c5e-de8e3b278f91,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61596,DS-62698e61-c13f-44f2-9da5-614945960221,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61605,DS-bbce6708-16fe-44ca-9f1c-506cf00f7e0d,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61592,DS-9cdd4afd-2dc8-40da-8805-09712e2afcc4,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61621,DS-f2a72d28-c880-4ffe-a70f-0f403e374504,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61629,DS-fa6ac558-2c38-41fe-9ef8-222b3f6b2b3c,DISK](LIVE)] > {quote} > It would be useful for admins if it reports the blockIds of the internal > blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11124) Report blockIds of internal blocks for EC files in Fsck
[ https://issues.apache.org/jira/browse/HDFS-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836616#comment-15836616 ] Jing Zhao commented on HDFS-11124: -- Thanks for updating the patch, [~tasanuma0829]! The patch looks pretty good to me. Some nits: # The new code in {{getReplicaInfo}} can be further simplified by changing the map from storage->index to storage->internal_id. Something like this: {code} final boolean isStriped = storedBlock.isStriped(); Mapstorage2Id = new HashMap<>(); if (isStriped && isComplete) { long blockId = storedBlock.getBlockId(); Iterable sis = ((BlockInfoStriped)storedBlock).getStorageAndIndexInfos(); for (StorageAndBlockIndex si: sis){ storage2Id.put(si.getStorage(), blockId + si.getBlockIndex()); } } {code} # I just noticed {{testFsckOpenECFiles}} is writing a very large file to generate 2 blocks. Let's use this chance to avoid writing too much data by changing the block size in the configuration (maybe 2 striped per block). # In the test we can also check if the output for the last open block is correct. # {{getReplicaInfo}} may be further optimized: if we change the "for" loop to go through an {{Iterable}}, we can avoid scanning the storages multiple time in {{blockManager#getStorages}} and {{BlockInfoStriped#getStorageAndIndexInfos}}. We can do this in a separate jira. > Report blockIds of internal blocks for EC files in Fsck > --- > > Key: HDFS-11124 > URL: https://issues.apache.org/jira/browse/HDFS-11124 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Labels: hdfs-ec-3.0-nice-to-have > Fix For: 3.0.0-alpha2 > > Attachments: HDFS-11124.1.patch, HDFS-11124.2.patch > > > At the moment, when we do fsck for an EC file which has corrupt blocks and > missing blocks, the result of fsck is like this: > {quote} > /data/striped 393216 bytes, erasure-coded: policy=RS-DEFAULT-6-3-64k, 1 > block(s): > /data/striped: CORRUPT blockpool BP-1204772930-172.16.165.209-1478761131832 > block blk_-9223372036854775792 > CORRUPT 1 blocks of total size 393216 B > 0. BP-1204772930-172.16.165.209-1478761131832:blk_-9223372036854775792_1001 > len=393216 Live_repl=4 > [DatanodeInfoWithStorage[127.0.0.1:61617,DS-bcfebe1f-ff54-4d57-9258-ff5bdfde01b5,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61601,DS-9abf64d0-bb6b-434c-8c5e-de8e3b278f91,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61596,DS-62698e61-c13f-44f2-9da5-614945960221,DISK](CORRUPT), > > DatanodeInfoWithStorage[127.0.0.1:61605,DS-bbce6708-16fe-44ca-9f1c-506cf00f7e0d,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61592,DS-9cdd4afd-2dc8-40da-8805-09712e2afcc4,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61621,DS-f2a72d28-c880-4ffe-a70f-0f403e374504,DISK](LIVE), > > DatanodeInfoWithStorage[127.0.0.1:61629,DS-fa6ac558-2c38-41fe-9ef8-222b3f6b2b3c,DISK](LIVE)] > {quote} > It would be useful for admins if it reports the blockIds of the internal > blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Attachment: HDFS-8498.001.patch Update the patch to fix bug when block is initialized as null. Also slightly changed TestDFSOutputStream.java to trigger tests in hadoop-hdfs. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Attachments: HDFS-8498.000.patch, HDFS-8498.001.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Assignee: Jing Zhao (was: Daryn Sharp) Status: Patch Available (was: Reopened) > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Jing Zhao >Priority: Critical > Attachments: HDFS-8498.000.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828903#comment-15828903 ] Jing Zhao commented on HDFS-4025: - Thanks for the update, [~hkoneru]. The current patch looks better. Further comments: # Journal segment transfer timeout should not share the same configuration with image transfer timeout, since a log segment is usually smaller than the fsimage. Let's create a new configuration property for it. # Accordingly we do not need a public method Util#getImageTransferTimeout. # In Storage.java, the following is a good change. But the code needs a reformat so that code like "sd.getStorageDirType" does not break into two lines. Besides, I think the patch will not use DirIterator anymore, so this change can also be done in a separate jira. {code} private boolean shouldReturnNextDir() { StorageDirectory sd = getStorageDir(nextIndex); - return (dirType == null || sd.getStorageDirType().isOfType(dirType)) && - (includeShared || !sd.isShared()); + return (dirType == null || (sd.getStorageDirType() != null && sd +.getStorageDirType().isOfType(dirType))) && (includeShared || !sd +.isShared()); } {code} # No need to define EDITS/EDITS_INPROGRESS etc. again in JNStorage. Actually currently JournalNode shares the same storage layout as NameNode, and directly uses FileJournalManager which is defined in the namenode package. So it's ok to use EDITS/EDITS_INPROGRESS defined in NNStorage. We can do further code cleanup as a follow-on task. # Similarly please see if we still need JNStorage#getTemporaryEditsFile and JNStorage#getFinalizedEditsFile. # getNamespaceInfo can be defined in Storage and let NNStorage override it. JNStorage can directly use the base version. # Journal#renameTemporarySegments can be renamed to renameTmpSegment since we're renaming a single segment here. Also no need to call Util#deleteTmpFiles. Just simply call File#delete and check its result. # In JournalNodeSyncer, some fields (e.g., journal, jn, jnStorage, conf) can be declared as final. "NULL_CONTROLLER" can be skipped. # Maybe we do not need two lists: otherJournalNodeAddrs and journalNodeProxiesList. We can create a wrapper class to wrap both InetSocketAddress and QJournalProtocolPB inside. In this way we only need one list. # "syncJournalDaemon.setDaemon(true);" is unnecessary since syncJournalDaemon is already a Daemon. # getMissingLogList cannot guarantee the returned ArrayList is sorted according to the transaction id, since the ArrayList is created based on a HashSet. Therefore 1) we cannot guarantee we're downloading older segments first, 2) the getNextContinuousTxId logic can be wrong. # The whole "getMissingLogSegments" may need to be redesigned: #* getMissingLogList can utilize merge-sort like logic to generate the missing list #* Each time we download a missing segment successfully, we should update lastSyncedTxId accordingly. #* Once we hit any exception while downloading from the remote JN, we can stop the current syncing and continue downloading in the next sync session from another JN. #* Once lastSyncedTxId has reached the last finalized segment, normally the current JN has caught up. We can reset the lastSyncedTxId back. # Some further optimization can also be done on getMissingLogList: #* the remote JN http URLs can be stored in JNSyncer #* if we know some segments are missing but we did not downloaded in the previous sync, we can directly download them from a new JN without calling getEditLogManifest RPC. These can be done separately as follow-on. # We also need to add a DataTransferThrottler for the downloading to avoid occupying too much network bandwidth. See TransferFsImage for an example. # downloadEditLogFromJournalHttpServer can have a shorter name, maybe downloadSegment? # In downloadEditLogFromJournalHttpServer, no need to call jnStorage.getFiles since we do not require any special storage dir type here. You can directly check if finalEditsFile exists. {code} File finalEditsFile = jnStorage.getFinalizedEditsFile(log.getStartTxId(), log.getEndTxId()); List finalFiles = jnStorage.getFiles(null, finalEditsFile.getName()); assert !(finalFiles.isEmpty()) : "No checkpoint targets."; {code} # Similarly before calling doGetUrl, no need to generate tmpFiles list. Instead use ImmutableList.of(tmpEditsFile). We also need to handle Exceptions other than IOException. See Journal#syncLog as an example. # renameTemporarySegments can be renamed to renameTmpSegment. We can let this method to return boolean: in this way if the rename fails the tmp file deletion can be done out of the lock. # For the test we can set a smaller sync interval so that the test can be faster. # We need to add more tests to cover different scenarios: #* multiple segments are missing #* discontinuous segments are missing #* more than
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Attachment: HDFS-8498.000.patch Vinay's proposed solution looks good to me. For the implementation, instead of directly changing ExtendedBlock which is used everywhere, maybe we can create a Block-similar structure which is thread safe and only used by DFSOutputStream/DataStreamer internally. Upload a patch to demo the idea. Please comment. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-8498.000.patch > > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819779#comment-15819779 ] Jing Zhao commented on HDFS-4025: - Thanks for updating the patch, [~hkoneru]. Some further comments: # We do not need to move getAddressesList() to DatanodeUtil. # getAddressList() and getOtherJournalNodeAddrs can be combined into one util method: getLoggerAddresses(URI uri, Set toExclude). # Need to clean the uused imports and unused variables in JournalNodeSyncer.java # sync_journals_timeout should not be retrieved from a newly created configuration in a static code block. It should be initialized based on the configuration passed to JournalNodeSyncer constructor. # We need to make sure syncJournalDaemon is always running while the JN is alive. So syncJournals should be in a try-catch block which catches Throwables. Please see BlockManager.RedundancyMonitor#run as an example. # Need to stop syncers when stopping JN. # The temp log segment files should be always be downloaded into the current directory. Thus downloadEditLogFromJournalHttpServer can be further simplified. # The current code may hit a race during the rolling-upgrade rollback. If the rollback happens, some log segments may be deleted while a syncer may download them from a remote JN which gets delayed in the rollback. Thus renaming temp journal files needs to be protected by Journal's monitor and we need to make sure its end index is smaller than the current committedTxnId. # We can consider adding a configuration flag to turn off this feature. # We do not need to get the local local log manifest for each syncing. The local log segment manifest can be reused. > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-8498: Target Version/s: 2.9.0, 3.0.0-alpha2 (was: 2.7.3) > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao reopened HDFS-8498: - > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size
[ https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816262#comment-15816262 ] Jing Zhao commented on HDFS-8498: - We also saw the same scenario as described by [~vinayrpet]. Maybe we can reopen this jira and explore the solution proposed by [~vinayrpet]. > Blocks can be committed with wrong size > --- > > Key: HDFS-8498 > URL: https://issues.apache.org/jira/browse/HDFS-8498 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > > When an IBR for a UC block arrives, the NN updates the expected location's > block and replica state _only_ if it's on an unexpected storage for an > expected DN. If it's for an expected storage, only the genstamp is updated. > When the block is committed, and the expected locations are verified, only > the genstamp is checked. The size is not checked but it wasn't updated in > the expected locations anyway. > A faulty client may misreport the size when committing the block. The block > is effectively corrupted. If the NN issues replications, the received IBR is > considered corrupt, the NN invalidates the block, immediately issues another > replication. The NN eventually realizes all the original replicas are > corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11273) Move TransferFsImage#doGetUrl function to a Util class
[ https://issues.apache.org/jira/browse/HDFS-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11273: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha2 Status: Resolved (was: Patch Available) The latest patch looks good to me. The failed tests should be unrelated and they passed in my local machine. +1 I've committed the patch to trunk. Thanks for the contribution, [~hkoneru]! > Move TransferFsImage#doGetUrl function to a Util class > -- > > Key: HDFS-11273 > URL: https://issues.apache.org/jira/browse/HDFS-11273 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Fix For: 3.0.0-alpha2 > > Attachments: HDFS-11273.000.patch, HDFS-11273.001.patch, > HDFS-11273.002.patch, HDFS-11273.003.patch, HDFS-11273.004.patch > > > TransferFsImage#doGetUrl downloads files from the specified url and stores > them in the specified storage location. HDFS-4025 plans to synchronize the > log segments in JournalNodes. If a log segment is missing from a JN, the JN > downloads it from another JN which has the required log segment. We need > TransferFsImage#doGetUrl and TransferFsImage#receiveFile to accomplish this. > So we propose to move the said functions to a Utility class so as to be able > to use it for JournalNode syncing as well, without duplication of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11273) Move TransferFsImage#doGetUrl function to a Util class
[ https://issues.apache.org/jira/browse/HDFS-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799513#comment-15799513 ] Jing Zhao commented on HDFS-11273: -- Thanks for updating the patch, [~hkoneru]! The updated patch is almost there. I have several extra minor comments (sorry I did not mention them in my last review...): # The new Util#setTimeout method may no longer only load timeout value from "DFS_IMAGE_TRANSFER_TIMEOUT_KEY". Thus the code loading timeout from configuration can be left in TransferFsImage#doGetUrl. {code} + /** + * Sets a timeout value in millisecods for the Http connection. + * @param connection the Http connection for which timeout needs to be set + * @param timeout value to be set as timeout in milliseconds + */ + public static void setTimeout(HttpURLConnection connection, int timeout) { +if (timeout <= 0) { + Configuration conf = new HdfsConfiguration(); + timeout = conf.getInt( + DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_KEY, + DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT); + LOG.info("Image Transfer timeout configured to " + timeout + + " milliseconds"); +} + +if (timeout > 0) { + connection.setConnectTimeout(timeout); + connection.setReadTimeout(timeout); +} + } {code} # HttpGetFailedException can be defined as an upper level class and be moved to the o.a.h.hdfs.server.common package. # The following code can be reformatted. {code} + public static MD5Hash doGetUrl(URL url, List localPaths, + Storage dstStorage, boolean getChecksum, URLConnectionFactory + connectionFactory, int ioFileBufferSize, boolean isSpnegoEnabled, int + timeout) throws IOException { {code} > Move TransferFsImage#doGetUrl function to a Util class > -- > > Key: HDFS-11273 > URL: https://issues.apache.org/jira/browse/HDFS-11273 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11273.000.patch, HDFS-11273.001.patch > > > TransferFsImage#doGetUrl downloads files from the specified url and stores > them in the specified storage location. HDFS-4025 plans to synchronize the > log segments in JournalNodes. If a log segment is missing from a JN, the JN > downloads it from another JN which has the required log segment. We need > TransferFsImage#doGetUrl and TransferFsImage#receiveFile to accomplish this. > So we propose to move the said functions to a Utility class so as to be able > to use it for JournalNode syncing as well, without duplication of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11273) Move TransferFsImage#doGetUrl function to a Util class
[ https://issues.apache.org/jira/browse/HDFS-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796829#comment-15796829 ] Jing Zhao commented on HDFS-11273: -- Thanks for the patch, [~hkoneru]! The patch looks good to me. Just two nits: # We can use this chance to cleanup the imports of TransferFsImage and Util. # In Util.java, CONTENT_LENGTH, MD5_HEADER, and deleteTmpFiles do not need to be public. Besides we can add a little more details in the description to explain why moving the code is necessary. > Move TransferFsImage#doGetUrl function to a Util class > -- > > Key: HDFS-11273 > URL: https://issues.apache.org/jira/browse/HDFS-11273 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11273.000.patch > > > TransferFsImage#doGetUrl function is required for JournalNode syncing as > well. We can move the code to a Utility class to avoid duplication of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11273) Move TransferFsImage#doGetUrl function to a Util class
[ https://issues.apache.org/jira/browse/HDFS-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11273: - Issue Type: Improvement (was: Bug) > Move TransferFsImage#doGetUrl function to a Util class > -- > > Key: HDFS-11273 > URL: https://issues.apache.org/jira/browse/HDFS-11273 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11273.000.patch > > > TransferFsImage#doGetUrl function is required for JournalNode syncing as > well. We can move the code to a Utility class to avoid duplication of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11273) Move TransferFsImage#doGetUrl function to a Util class
[ https://issues.apache.org/jira/browse/HDFS-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-11273: - Component/s: (was: hdfs) > Move TransferFsImage#doGetUrl function to a Util class > -- > > Key: HDFS-11273 > URL: https://issues.apache.org/jira/browse/HDFS-11273 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HDFS-11273.000.patch > > > TransferFsImage#doGetUrl function is required for JournalNode syncing as > well. We can move the code to a Utility class to avoid duplication of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4025) QJM: Sychronize past log segments to JNs that missed them
[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655302#comment-15655302 ] Jing Zhao commented on HDFS-4025: - Thanks for the patch, [~hanishakoneru]! The patch looks good to me in general. Please see comments below: # In JournalNodeSyncer#startSyncJournalsThread, the following sleep may be unnecessary: in most of the cases the journal is formatted before we start the sync thread. {code} try { // Wait for the JournalNodes to get formatted before attempting sync Thread.sleep(SYNC_JOURNALS_TIMEOUT/2); } catch (InterruptedException e) { LOG.error(e); } {code} # The syncJournalThread should be daemon. Also we can add a flag to control when the thread should exit the while loop. # {{getAllJournalNodeAddrs}} shares the same functionality with {{QuorumJournalManager#getLoggerAddresses}}. We can convert it into a utility function and use it in these two places. # Since currently we do not support changing Journal Node configuration while JN is running, we can initialize all the other JN proxies in the very beginning. Then later we can randomly pick a proxy instead of an InetSocketAddress. # We usually only deploy 3 or 5 JNs in practice, thus we may also choose a Round-Robin way to pick sync target. Also if an error/exception happens during the sync, we can wait till the next run (instead of retrying another JN immediately). # Typo: getMisingLogList --> getMissingLogList # {{getMisingLogList}} can use merge-sort style to compare the two lists. # Let's see if we can avoid copying code from {{TransferFsImage}} but reuse its methods. # We need to make sure we finally purge old tmp editlog files due to failures during the downloading/renaming. > QJM: Sychronize past log segments to JNs that missed them > - > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha >Affects Versions: QuorumJournalManager (HDFS-3077) >Reporter: Todd Lipcon >Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11095) BlockManagerSafeMode should respect extension period default config value (30s)
[ https://issues.apache.org/jira/browse/HDFS-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630482#comment-15630482 ] Jing Zhao commented on HDFS-11095: -- +1 pending Jenkins. > BlockManagerSafeMode should respect extension period default config value > (30s) > --- > > Key: HDFS-11095 > URL: https://issues.apache.org/jira/browse/HDFS-11095 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-11095.000.patch, HDFS-11095.001.patch > > > {code:title=BlockManagerSafeMode.java} > this.extension = conf.getInt(DFS_NAMENODE_SAFEMODE_EXTENSION_KEY, 0); > {code} > Though the default value (30s) is loaded from {{hdfs-default.xml}}, we should > also respect this in the code by using > {{DFSConfigKeys#DFS_NAMENODE_SAFEMODE_EXTENSION_DEFAULT}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11090) Leave safemode immediately if all blocks have reported in
[ https://issues.apache.org/jira/browse/HDFS-11090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629948#comment-15629948 ] Jing Zhao commented on HDFS-11090: -- If 100% block threshold has been met, this means all the blocks have achieved minimum replication requirement (usually 1 replica). Therefore it is still possible that NN has not received some FBR. To have safemode extension can still avoid unnecessary replication work. But in the meanwhile, the number of pending FBR in the above scenario should be limited, considering we're using random replication. Also we already have extra logic to initialize replication queue earlier ({{initializeReplQueuesIfNecessary}}). My main concern about the approach in the current patch is whether it is that useful in practice. For a large cluster, it is not rare to have a few missing blocks, or at least we have to wait for a long time to have 100% block safe, thus ppl usually set safemode threshold <1. For a small cluster, we can directly set the safemode extension to 0 in the configuration. So do we want to add some extra check to the safemode code which is already very complicated? > Leave safemode immediately if all blocks have reported in > - > > Key: HDFS-11090 > URL: https://issues.apache.org/jira/browse/HDFS-11090 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.7.3 >Reporter: Andrew Wang >Assignee: Yiqun Lin > Attachments: HDFS-11090.001.patch > > > Startup safemode is triggered by two thresholds: % blocks reported in, and > min # datanodes. It's extended by an interval (default 30s) until these two > thresholds are met. > Safemode extension is helpful when the cluster has data, and the default % > blocks threshold (0.99) is used. It gives DNs a little extra time to report > in and thus avoid unnecessary replication work. > However, we can leave startup safemode early if 100% of blocks have reported > in. > Note that operators sometimes change the % blocks threshold to > 1 to never > automatically leave safemode. We should maintain this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10827) When there are unrecoverable ec block groups, Namenode Web UI shows "There are X missing blocks." but doesn't show the block names.
[ https://issues.apache.org/jira/browse/HDFS-10827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-10827: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha2 Status: Resolved (was: Patch Available) +1 on the latest patch. I've committed it into trunk. Thanks for the contribution, [~tasanuma0829]! > When there are unrecoverable ec block groups, Namenode Web UI shows "There > are X missing blocks." but doesn't show the block names. > --- > > Key: HDFS-10827 > URL: https://issues.apache.org/jira/browse/HDFS-10827 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Fix For: 3.0.0-alpha2 > > Attachments: HDFS-10827.1.patch, HDFS-10827.2.patch, > HDFS-10827.3.patch, HDFS-10827.4.patch, case_2.png, case_3.png > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, NN Web UI doesn't show any warns. > 2) 4~8 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." but doesn't show the block names. (please see case_2.png) > 3) 9 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." and also shows the block name. (please see case_3.png) > We should fix the case 2). I think NN Web UI should show the block names > since the ec block group is unrecoverable. > The values come from JMX. "There are X missing blocks." is > {{NumberOfMissingBlocks}} and the block names are {{CorruptFiles}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10827) When there are unrecoverable ec block groups, Namenode Web UI shows "There are X missing blocks." but doesn't show the block names.
[ https://issues.apache.org/jira/browse/HDFS-10827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569512#comment-15569512 ] Jing Zhao commented on HDFS-10827: -- [~tasanuma0829], I think you're right. We actually do not need to add that extra check there. > When there are unrecoverable ec block groups, Namenode Web UI shows "There > are X missing blocks." but doesn't show the block names. > --- > > Key: HDFS-10827 > URL: https://issues.apache.org/jira/browse/HDFS-10827 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10827.1.patch, HDFS-10827.2.patch, > HDFS-10827.3.patch, case_2.png, case_3.png > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, NN Web UI doesn't show any warns. > 2) 4~8 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." but doesn't show the block names. (please see case_2.png) > 3) 9 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." and also shows the block name. (please see case_3.png) > We should fix the case 2). I think NN Web UI should show the block names > since the ec block group is unrecoverable. > The values come from JMX. "There are X missing blocks." is > {{NumberOfMissingBlocks}} and the block names are {{CorruptFiles}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10968) BlockManager#isInNewRack should consider decommissioning nodes
[ https://issues.apache.org/jira/browse/HDFS-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-10968: - Resolution: Fixed Fix Version/s: 3.0.0-alpha2 Status: Resolved (was: Patch Available) Thanks for the review, Nicholas! I've committed this into trunk. > BlockManager#isInNewRack should consider decommissioning nodes > -- > > Key: HDFS-10968 > URL: https://issues.apache.org/jira/browse/HDFS-10968 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding, namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Jing Zhao >Assignee: Jing Zhao > Fix For: 3.0.0-alpha2 > > Attachments: HDFS-10968.000.patch > > > For an EC block, it is possible we have enough internal blocks but without > enough racks. The current reconstruction code calls > {{BlockManager#isInNewRack}} to check if the target node can increase the > total rack number for the case, which compares the target node's rack with > source node racks: > {code} > for (DatanodeDescriptor src : srcs) { > if (src.getNetworkLocation().equals(target.getNetworkLocation())) { > return false; > } > } > {code} > However here the {{srcs}} may include a decommissioning node, in which case > we should allow the target node to be in the same rack with it. > For e.g., suppose we have 11 nodes: h1 ~ h11, which are located in racks r1, > r1, r2, r2, r3, r3, r4, r4, r5, r5, r6, respectively. In case that an EC > block has 9 live internal blocks on (h1~h8 + h11), and one internal block on > h9 which is to be decommissioned. The current code will not choose h10 for > reconstruction because isInNewRack thinks h10 is on the same rack with h9. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10759) Change fsimage bool isStriped from boolean to an enum
[ https://issues.apache.org/jira/browse/HDFS-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556534#comment-15556534 ] Jing Zhao commented on HDFS-10759: -- Yeah I also think the idea is good. But we need to guarantee the compatibility: the old fsimage should still be supported and new enum types should be easily added (which means we may need to add UNKNOWN_TYPE in the enum according to the link). > Change fsimage bool isStriped from boolean to an enum > - > > Key: HDFS-10759 > URL: https://issues.apache.org/jira/browse/HDFS-10759 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.0.0-alpha1, 3.0.0-beta1, 3.0.0-alpha2 >Reporter: Ewan Higgs > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-10759.0001.patch > > > The new erasure coding project has updated the protocol for fsimage such that > the {{INodeFile}} has a boolean '{{isStriped}}'. I think this is better as an > enum or integer since a boolean precludes any future block types. > For example: > {code} > enum BlockType { > CONTIGUOUS = 0, > STRIPED = 1, > } > {code} > We can also make this more robust to future changes where there are different > block types supported in a staged rollout. Here, we would use > {{UNKNOWN_BLOCK_TYPE}} as the first value since this is the default value. > See > [here|http://androiddevblog.com/protocol-buffers-pitfall-adding-enum-values/] > for more discussion. > {code} > enum BlockType { > UNKNOWN_BLOCK_TYPE = 0, > CONTIGUOUS = 1, > STRIPED = 2, > } > {code} > But I'm not convinced this is necessary since there are other enums that > don't use this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10827) When there are unrecoverable ec block groups, Namenode Web UI shows "There are X missing blocks." but doesn't show the block names.
[ https://issues.apache.org/jira/browse/HDFS-10827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1949#comment-1949 ] Jing Zhao commented on HDFS-10827: -- Thanks for working on this, [~tasanuma0829]! The patch looks good to me overall. Some minors: # I think we can rename "isCorrupt" to "isMissing". In the outer "while" loop, we're scanning all the blocks with {{QUEUE_WITH_CORRUPT_BLOCKS}} priority, and this "if" logic is to select missing blocks. # We may also want to take into account the decommissioning/decommissioned internal blocks for the check. # As follow-on work, maybe we can create a utility function to check if a block is corrupted/missing, considering this is widely used in BlockManager/FSNamesystem. # For the unit test, do you think we can use {{MiniDFSCluster#corruptReplica}} to simplify the code? > When there are unrecoverable ec block groups, Namenode Web UI shows "There > are X missing blocks." but doesn't show the block names. > --- > > Key: HDFS-10827 > URL: https://issues.apache.org/jira/browse/HDFS-10827 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10827.1.patch, HDFS-10827.2.patch, case_2.png, > case_3.png > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, NN Web UI doesn't show any warns. > 2) 4~8 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." but doesn't show the block names. (please see case_2.png) > 3) 9 out of 9 internal blocks are missing, NN Web UI shows "There are 1 > missing blocks." and also shows the block name. (please see case_3.png) > We should fix the case 2). I think NN Web UI should show the block names > since the ec block group is unrecoverable. > The values come from JMX. "There are X missing blocks." is > {{NumberOfMissingBlocks}} and the block names are {{CorruptFiles}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10968) BlockManager#isNewRack should consider decommissioning nodes
[ https://issues.apache.org/jira/browse/HDFS-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-10968: - Status: Patch Available (was: Open) > BlockManager#isNewRack should consider decommissioning nodes > > > Key: HDFS-10968 > URL: https://issues.apache.org/jira/browse/HDFS-10968 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding, namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Jing Zhao >Assignee: Jing Zhao > Attachments: HDFS-10968.000.patch > > > For an EC block, it is possible we have enough internal blocks but without > enough racks. The current reconstruction code calls > {{BlockManager#isNewRack}} to check if the target node can increase the total > rack number for the case, which compares the target node's rack with source > node racks: > {code} > for (DatanodeDescriptor src : srcs) { > if (src.getNetworkLocation().equals(target.getNetworkLocation())) { > return false; > } > } > {code} > However here the {{srcs}} may include a decommissioning node, in which case > we should allow the target node to be in the same rack with it. > For e.g., suppose we have 11 nodes: h1 ~ h11, which are located in racks r1, > r1, r2, r2, r3, r3, r4, r4, r5, r5, r6, respectively. In case that an EC > block has 9 live internal blocks on (h1~h8 + h11), and one internal block on > h9 which is to be decommissioned. The current code will not choose h10 for > reconstruction because isNewRack thinks h10 is on the same rack with h9. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10968) BlockManager#isNewRack should consider decommissioning nodes
[ https://issues.apache.org/jira/browse/HDFS-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-10968: - Attachment: HDFS-10968.000.patch Upload a patch to fix the issue. The patch also includes a unit test that reproduces the example in the description. > BlockManager#isNewRack should consider decommissioning nodes > > > Key: HDFS-10968 > URL: https://issues.apache.org/jira/browse/HDFS-10968 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding, namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Jing Zhao >Assignee: Jing Zhao > Attachments: HDFS-10968.000.patch > > > For an EC block, it is possible we have enough internal blocks but without > enough racks. The current reconstruction code calls > {{BlockManager#isNewRack}} to check if the target node can increase the total > rack number for the case, which compares the target node's rack with source > node racks: > {code} > for (DatanodeDescriptor src : srcs) { > if (src.getNetworkLocation().equals(target.getNetworkLocation())) { > return false; > } > } > {code} > However here the {{srcs}} may include a decommissioning node, in which case > we should allow the target node to be in the same rack with it. > For e.g., suppose we have 11 nodes: h1 ~ h11, which are located in racks r1, > r1, r2, r2, r3, r3, r4, r4, r5, r5, r6, respectively. In case that an EC > block has 9 live internal blocks on (h1~h8 + h11), and one internal block on > h9 which is to be decommissioned. The current code will not choose h10 for > reconstruction because isNewRack thinks h10 is on the same rack with h9. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-10968) BlockManager#isNewRack should consider decommissioning nodes
Jing Zhao created HDFS-10968: Summary: BlockManager#isNewRack should consider decommissioning nodes Key: HDFS-10968 URL: https://issues.apache.org/jira/browse/HDFS-10968 Project: Hadoop HDFS Issue Type: Sub-task Components: erasure-coding, namenode Affects Versions: 3.0.0-alpha1 Reporter: Jing Zhao Assignee: Jing Zhao For an EC block, it is possible we have enough internal blocks but without enough racks. The current reconstruction code calls {{BlockManager#isNewRack}} to check if the target node can increase the total rack number for the case, which compares the target node's rack with source node racks: {code} for (DatanodeDescriptor src : srcs) { if (src.getNetworkLocation().equals(target.getNetworkLocation())) { return false; } } {code} However here the {{srcs}} may include a decommissioning node, in which case we should allow the target node to be in the same rack with it. For e.g., suppose we have 11 nodes: h1 ~ h11, which are located in racks r1, r1, r2, r2, r3, r3, r4, r4, r5, r5, r6, respectively. In case that an EC block has 9 live internal blocks on (h1~h8 + h11), and one internal block on h9 which is to be decommissioned. The current code will not choose h10 for reconstruction because isNewRack thinks h10 is on the same rack with h9. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10826) Correctly report missing EC blocks in FSCK
[ https://issues.apache.org/jira/browse/HDFS-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-10826: - Summary: Correctly report missing EC blocks in FSCK (was: The result of fsck should be CRITICAL when there are unrecoverable ec block groups.) > Correctly report missing EC blocks in FSCK > -- > > Key: HDFS-10826 > URL: https://issues.apache.org/jira/browse/HDFS-10826 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10826.2.patch, HDFS-10826.3.patch, > HDFS-10826.4.patch, HDFS-10826.5.patch, HDFS-10826.WIP.1.patch > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, the result of fsck is HEALTY. > 2) 4~8 out of 9 internal blocks are missing, the result of fsck is HEALTY. > {noformat} > Erasure Coded Block Groups: > Total size:536870912 B > Total files: 1 > Total block groups (validated):1 (avg. block group size 536870912 B) > > UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) > > Minimally erasure-coded block groups: 0 (0.0 %) > Over-erasure-coded block groups: 0 (0.0 %) > Under-erasure-coded block groups: 1 (100.0 %) > Unsatisfactory placement block groups: 0 (0.0 %) > Default ecPolicy: RS-DEFAULT-6-3-64k > Average block group size: 5.0 > Missing block groups: 0 > Corrupt block groups: 0 > Missing internal blocks: 4 (44.43 %) > FSCK ended at Wed Aug 31 13:42:05 JST 2016 in 4 milliseconds > The filesystem under path '/' is HEALTHY > {noformat} > 3) 9 out of 9 internal blocks are missing, the result of fsck is CRITICAL. > (Because it is regarded as a missing block group.) > In case 2), the result should be CRITICAL since the ec block group is > unrecoverable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10826) The result of fsck should be CRITICAL when there are unrecoverable ec block groups.
[ https://issues.apache.org/jira/browse/HDFS-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549491#comment-15549491 ] Jing Zhao commented on HDFS-10826: -- I've committed the patch into trunk. Thanks for the contribution, [~tasanuma0829]. Thanks for the review, [~ajisakaa] and [~jojochuang]. > The result of fsck should be CRITICAL when there are unrecoverable ec block > groups. > --- > > Key: HDFS-10826 > URL: https://issues.apache.org/jira/browse/HDFS-10826 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10826.2.patch, HDFS-10826.3.patch, > HDFS-10826.4.patch, HDFS-10826.5.patch, HDFS-10826.WIP.1.patch > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, the result of fsck is HEALTY. > 2) 4~8 out of 9 internal blocks are missing, the result of fsck is HEALTY. > {noformat} > Erasure Coded Block Groups: > Total size:536870912 B > Total files: 1 > Total block groups (validated):1 (avg. block group size 536870912 B) > > UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) > > Minimally erasure-coded block groups: 0 (0.0 %) > Over-erasure-coded block groups: 0 (0.0 %) > Under-erasure-coded block groups: 1 (100.0 %) > Unsatisfactory placement block groups: 0 (0.0 %) > Default ecPolicy: RS-DEFAULT-6-3-64k > Average block group size: 5.0 > Missing block groups: 0 > Corrupt block groups: 0 > Missing internal blocks: 4 (44.43 %) > FSCK ended at Wed Aug 31 13:42:05 JST 2016 in 4 milliseconds > The filesystem under path '/' is HEALTHY > {noformat} > 3) 9 out of 9 internal blocks are missing, the result of fsck is CRITICAL. > (Because it is regarded as a missing block group.) > In case 2), the result should be CRITICAL since the ec block group is > unrecoverable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10826) The result of fsck should be CRITICAL when there are unrecoverable ec block groups.
[ https://issues.apache.org/jira/browse/HDFS-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546996#comment-15546996 ] Jing Zhao commented on HDFS-10826: -- The latest patch looks good to me. To do further code refactoring in HDFS-10933 also sounds good to me. Do you have further comments, [~jojochuang]? > The result of fsck should be CRITICAL when there are unrecoverable ec block > groups. > --- > > Key: HDFS-10826 > URL: https://issues.apache.org/jira/browse/HDFS-10826 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10826.2.patch, HDFS-10826.3.patch, > HDFS-10826.4.patch, HDFS-10826.5.patch, HDFS-10826.WIP.1.patch > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, the result of fsck is HEALTY. > 2) 4~8 out of 9 internal blocks are missing, the result of fsck is HEALTY. > {noformat} > Erasure Coded Block Groups: > Total size:536870912 B > Total files: 1 > Total block groups (validated):1 (avg. block group size 536870912 B) > > UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) > > Minimally erasure-coded block groups: 0 (0.0 %) > Over-erasure-coded block groups: 0 (0.0 %) > Under-erasure-coded block groups: 1 (100.0 %) > Unsatisfactory placement block groups: 0 (0.0 %) > Default ecPolicy: RS-DEFAULT-6-3-64k > Average block group size: 5.0 > Missing block groups: 0 > Corrupt block groups: 0 > Missing internal blocks: 4 (44.43 %) > FSCK ended at Wed Aug 31 13:42:05 JST 2016 in 4 milliseconds > The filesystem under path '/' is HEALTHY > {noformat} > 3) 9 out of 9 internal blocks are missing, the result of fsck is CRITICAL. > (Because it is regarded as a missing block group.) > In case 2), the result should be CRITICAL since the ec block group is > unrecoverable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
[ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533611#comment-15533611 ] Jing Zhao commented on HDFS-10797: -- Actually my proposal is like your .005 patch. The current semantic and approach looks good to me overall. > Disk usage summary of snapshots causes renamed blocks to get counted twice > -- > > Key: HDFS-10797 > URL: https://issues.apache.org/jira/browse/HDFS-10797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, > HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch > > > DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how > much disk usage is used by a snapshot by tallying up the files in the > snapshot that have since been deleted (that way it won't overlap with regular > files whose disk usage is computed separately). However that is determined > from a diff that shows moved (to Trash or otherwise) or renamed files as a > deletion and a creation operation that may overlap with the list of blocks. > Only the deletion operation is taken into consideration, and this causes > those blocks to get represented twice in the disk usage tallying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
[ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531432#comment-15531432 ] Jing Zhao commented on HDFS-10797: -- [~mackrorysd], I agree it will be great to have a consistent and user-friendly semantic. To me a better semantic can be like this: if the renamed source (which is inside of some snapshot) and the renamed target are both under the same directory for counting, we count them once. Otherwise they will be counted separately. With this semantic maybe we only need to move your hashset to the context object passed from the beginning of the counting call, and use it to avoid duplicated counting. What do you think? > Disk usage summary of snapshots causes renamed blocks to get counted twice > -- > > Key: HDFS-10797 > URL: https://issues.apache.org/jira/browse/HDFS-10797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, > HDFS-10797.003.patch > > > DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how > much disk usage is used by a snapshot by tallying up the files in the > snapshot that have since been deleted (that way it won't overlap with regular > files whose disk usage is computed separately). However that is determined > from a diff that shows moved (to Trash or otherwise) or renamed files as a > deletion and a creation operation that may overlap with the list of blocks. > Only the deletion operation is taken into consideration, and this causes > those blocks to get represented twice in the disk usage tallying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10897) Ozone: SCM: Add NodeManager
[ https://issues.apache.org/jira/browse/HDFS-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531259#comment-15531259 ] Jing Zhao commented on HDFS-10897: -- Thanks for the reply, Anu. bq. The reason of breaking up these data structures into 3 separate maps is to reduce the single lock contention we seem to run into in the current HDFS. I think we can still avoid lock contention with a single map. Also most stale nodes are temporary and dead nodes may be directly removed. So it may not be very helpful to have separate maps for them. bq. Just want to make sure that we are both on the same page on this one. In the current scheme, we get a heartbeat and insert it into a queue – with no time stamp. Here my concern is that we may need at least two threads for the work done by the current worker. Dead node detection work may need to be separate out and done by another thread (as today's HeartbeatMonitor) considering there may be a lot of following work after a dead node is detected (e.g., triggering re-replication of containers etc.). Putting all the work, including handling heartbeat msgs and scanning all the healthy/stale nodes, into a single loop may finally lead to limit throughput for handling heartbeats. I think currently most of my concerns have been or can be addressed your future patches. So I'm +1 on the current patch and we can continue the discussion. > Ozone: SCM: Add NodeManager > --- > > Key: HDFS-10897 > URL: https://issues.apache.org/jira/browse/HDFS-10897 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Affects Versions: HDFS-7240 >Reporter: Anu Engineer >Assignee: Anu Engineer > Attachments: HDFS-10897-HDFS-7240.001.patch, > HDFS-10897-HDFS-7240.002.patch, HDFS-10897-HDFS-7240.003.patch > > > Add a nodeManager class that will be used by Storage Controller Manager > eventually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10912) Ozone:SCM: Add safe mode support to NodeManager.
[ https://issues.apache.org/jira/browse/HDFS-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530843#comment-15530843 ] Jing Zhao commented on HDFS-10912: -- Thanks for the work, [~anu]. For SCM, I think we may have two different types of "safemode": # the first one is to make sure the SCM receives enough DN registration. If we persist the container-node mapping in SCM, we do not need to wait for full container reports. Also SCM does not take the responsibility for maintaining the container states/durability, thus this type of safemode is very lightweight compared with the current NN safemode. (maybe we can rename it ...) # the second one is the manual safemode (triggered by {{forceEnterSafeMode}}). This safemode is actually against the whole SCM instead of its node manager (just like in today's HDFS the manual safemode is for the whole NN instead of the blockmanager). Therefore, to me {{forceExitSafeMode}}/{{forceEnterSafeMode}}/{{isInManualSafeMode}} can be moved to SCM level. {{forceExitSafeMode}} will reset both the manual safemode and the safemode in nodemanager. > Ozone:SCM: Add safe mode support to NodeManager. > > > Key: HDFS-10912 > URL: https://issues.apache.org/jira/browse/HDFS-10912 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Affects Versions: HDFS-7240 >Reporter: Anu Engineer >Assignee: Anu Engineer > Attachments: HDFS-10912-HDFS-7240.001.patch > > > Add Safe mode support : That is add the ability to force exit or enter safe > mode. As well as get the current safe mode status from node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10897) Ozone: SCM: Add NodeManager
[ https://issues.apache.org/jira/browse/HDFS-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530818#comment-15530818 ] Jing Zhao edited comment on HDFS-10897 at 9/28/16 8:43 PM: --- Thanks for working on this, [~anu]. The patch looks good to me overall. Some comments: # My main concern is about the current way tracking the heartbeat time for DataNodes. Instead of using 3 String-Long maps, I think it's better to use {{DatanodeInfo}} (or a simplified version) to store the latest heartbeat/report time. Later we still need to capture other information about DataNodes (its current load and state etc.) thus {{DatanodeInfo}} can be the central place to store all the information about a DN (just like today's HDFS). Also in this way we only need to maintain a single datanode map (which is more static compared with the current 3 maps) and most of the lock protection can be put into the DatanodeInfo level. # Also with this change we can have a more fair way for heartbeat time calculation: for every heartbeat msg, we can update the corresponding datanode's latest update time before putting the heartbeat into the queue, in order to avoid the penalty on DN due to SCM's local latency. # For Node state, we may want to follow the current HDFS, i.e., we need to have AdminStates which includes NORMAL, DECOMMISSION_INPROGRESS, DECOMMISSIONED, ENTERING_MAINTENANCE, and IN_MAINTENANCE. Stale/dead are calculated based on the latest heartbeat time thus maybe we do not need to define them as an explicit state (and for dead nodes we may want to directly remove it). {code} 36 * 4. A node can be in any of these 4 states: {HEALTHY, STALE, DEAD, 37 * DECOMMISSIONED} 38 * 39 * HEALTHY - It is a datanode that is regularly heartbeating us. 40 * 41 * STALE - A datanode for which we have missed few heart beats. 42 * 43 * DEAD - A datanode that we have not heard from for a while. 44 * 45 * DECOMMISSIONED - Someone told us to remove this node from the tracking 46 * list, by calling removeNode. We will throw away this nodes info soon. {code} # {{getNodes}}/{{getNodeCount}} can be defined in a metrics interface (like today's FSNamesystemMBean). # Any reason we need a NodeManager interface? was (Author: jingzhao): Thanks for working on this, [~anu]. The patch looks good to me overall. Some comments: # My main concern is about the current way tracking the heartbeat time for DataNodes. Instead of using 3 String-Long maps, I think it's better to use {{DatanodeInfo}} to store the latest heartbeat/report time. Later we still need to capture other information about DataNodes (its current load and state etc.) thus {{DatanodeInfo}} can be the central place to store all the information about a DN (just like today's HDFS). Also in this way we only need to maintain a single datanode map (which is more static compared with the current 3 maps) and most of the lock protection can be put into the DatanodeInfo level. # Also with this change we can have a more fair way for heartbeat time calculation: for every heartbeat msg, we can update the corresponding datanode's latest update time before putting the heartbeat into the queue, in order to avoid the penalty on DN due to SCM's local latency. # For Node state, we may want to follow the current HDFS, i.e., we need to have AdminStates which includes NORMAL, DECOMMISSION_INPROGRESS, DECOMMISSIONED, ENTERING_MAINTENANCE, and IN_MAINTENANCE. Stale/dead are calculated based on the latest heartbeat time thus maybe we do not need to define them as an explicit state (and for dead nodes we may want to directly remove it). {code} 36 * 4. A node can be in any of these 4 states: {HEALTHY, STALE, DEAD, 37 * DECOMMISSIONED} 38 * 39 * HEALTHY - It is a datanode that is regularly heartbeating us. 40 * 41 * STALE - A datanode for which we have missed few heart beats. 42 * 43 * DEAD - A datanode that we have not heard from for a while. 44 * 45 * DECOMMISSIONED - Someone told us to remove this node from the tracking 46 * list, by calling removeNode. We will throw away this nodes info soon. {code} # {{getNodes}}/{{getNodeCount}} can be defined in a metrics interface (like today's FSNamesystemMBean). # Any reason we need a NodeManager interface? > Ozone: SCM: Add NodeManager > --- > > Key: HDFS-10897 > URL: https://issues.apache.org/jira/browse/HDFS-10897 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Affects Versions: HDFS-7240 >Reporter: Anu Engineer >Assignee: Anu Engineer > Attachments: HDFS-10897-HDFS-7240.001.patch, > HDFS-10897-HDFS-7240.002.patch, HDFS-10897-HDFS-7240.003.patch > > > Add a nodeManager class that will
[jira] [Commented] (HDFS-10897) Ozone: SCM: Add NodeManager
[ https://issues.apache.org/jira/browse/HDFS-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530818#comment-15530818 ] Jing Zhao commented on HDFS-10897: -- Thanks for working on this, [~anu]. The patch looks good to me overall. Some comments: # My main concern is about the current way tracking the heartbeat time for DataNodes. Instead of using 3 String-Long maps, I think it's better to use {{DatanodeInfo}} to store the latest heartbeat/report time. Later we still need to capture other information about DataNodes (its current load and state etc.) thus {{DatanodeInfo}} can be the central place to store all the information about a DN (just like today's HDFS). Also in this way we only need to maintain a single datanode map (which is more static compared with the current 3 maps) and most of the lock protection can be put into the DatanodeInfo level. # Also with this change we can have a more fair way for heartbeat time calculation: for every heartbeat msg, we can update the corresponding datanode's latest update time before putting the heartbeat into the queue, in order to avoid the penalty on DN due to SCM's local latency. # For Node state, we may want to follow the current HDFS, i.e., we need to have AdminStates which includes NORMAL, DECOMMISSION_INPROGRESS, DECOMMISSIONED, ENTERING_MAINTENANCE, and IN_MAINTENANCE. Stale/dead are calculated based on the latest heartbeat time thus maybe we do not need to define them as an explicit state (and for dead nodes we may want to directly remove it). {code} 36 * 4. A node can be in any of these 4 states: {HEALTHY, STALE, DEAD, 37 * DECOMMISSIONED} 38 * 39 * HEALTHY - It is a datanode that is regularly heartbeating us. 40 * 41 * STALE - A datanode for which we have missed few heart beats. 42 * 43 * DEAD - A datanode that we have not heard from for a while. 44 * 45 * DECOMMISSIONED - Someone told us to remove this node from the tracking 46 * list, by calling removeNode. We will throw away this nodes info soon. {code} # {{getNodes}}/{{getNodeCount}} can be defined in a metrics interface (like today's FSNamesystemMBean). # Any reason we need a NodeManager interface? > Ozone: SCM: Add NodeManager > --- > > Key: HDFS-10897 > URL: https://issues.apache.org/jira/browse/HDFS-10897 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Affects Versions: HDFS-7240 >Reporter: Anu Engineer >Assignee: Anu Engineer > Attachments: HDFS-10897-HDFS-7240.001.patch, > HDFS-10897-HDFS-7240.002.patch, HDFS-10897-HDFS-7240.003.patch > > > Add a nodeManager class that will be used by Storage Controller Manager > eventually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10826) The result of fsck should be CRITICAL when there are unrecoverable ec block groups.
[ https://issues.apache.org/jira/browse/HDFS-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530710#comment-15530710 ] Jing Zhao commented on HDFS-10826: -- # The failure of TestLeaseRecoveryStriped seems related. Could you please take a look, [~tasanuma0829]? # {{countNodes}} has already been called in {{createLocatedBlock}}. We can reuse the result. {code} 1071final boolean isCorrupt; 1072if (blk.isStriped()) { 1073 BlockInfoStriped sblk = (BlockInfoStriped) blk; 1074 isCorrupt = numCorruptReplicas != 0 && 1075 countNodes(blk).liveReplicas() < sblk.getRealDataBlockNum(); 1076} else { 1077 isCorrupt = numCorruptReplicas != 0 && numCorruptReplicas == numNodes; 1078} {code} > The result of fsck should be CRITICAL when there are unrecoverable ec block > groups. > --- > > Key: HDFS-10826 > URL: https://issues.apache.org/jira/browse/HDFS-10826 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma > Attachments: HDFS-10826.2.patch, HDFS-10826.3.patch, > HDFS-10826.WIP.1.patch > > > For RS-6-3, when there is one ec block group and > 1) 0~3 out of 9 internal blocks are missing, the result of fsck is HEALTY. > 2) 4~8 out of 9 internal blocks are missing, the result of fsck is HEALTY. > {noformat} > Erasure Coded Block Groups: > Total size:536870912 B > Total files: 1 > Total block groups (validated):1 (avg. block group size 536870912 B) > > UNRECOVERABLE BLOCK GROUPS: 1 (100.0 %) > > Minimally erasure-coded block groups: 0 (0.0 %) > Over-erasure-coded block groups: 0 (0.0 %) > Under-erasure-coded block groups: 1 (100.0 %) > Unsatisfactory placement block groups: 0 (0.0 %) > Default ecPolicy: RS-DEFAULT-6-3-64k > Average block group size: 5.0 > Missing block groups: 0 > Corrupt block groups: 0 > Missing internal blocks: 4 (44.43 %) > FSCK ended at Wed Aug 31 13:42:05 JST 2016 in 4 milliseconds > The filesystem under path '/' is HEALTHY > {noformat} > 3) 9 out of 9 internal blocks are missing, the result of fsck is CRITICAL. > (Because it is regarded as a missing block group.) > In case 2), the result should be CRITICAL since the ec block group is > unrecoverable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org