[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801657#comment-16801657 ] Wei-Chiu Chuang commented on HDFS-12049: We saw this bug recently. As a data point, recommissioning a DataNode with ~2 million blocks stalled NN for nearly 5 minutes. Examine jstack collected during the recomm, I suspect logging contributed partially to the slowdown. The cluster's version doesn't have HDFS-6860, and so every invalidated block print a line of INFO log. HDFS-6860 should alleviate this pain, but in the end, it doesn't make much sense to hold NN lock and iterate through all replicas on all recomm DNs. The lock should be released periodically. HDFS-10477 reports a similar issue and a patch is pending, but there are concurrency concerns with the patch if lock is released while iterating block iterator. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617249#comment-16617249 ] Sunil Govindan commented on HDFS-12049: --- As code freeze for 3.2 is crossed, moving this Jira to 3.3. Please feel free to revert if anyone has concerns. Thank you. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608706#comment-16608706 ] Sunil Govindan commented on HDFS-12049: --- Hi [~daryn], Could u please help to check on this issue. As there is no progress and code freeze for 3.2.0 is nearing, we can move to 3.3.0 if there are no immediate plans. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595913#comment-16595913 ] Sunil Govindan commented on HDFS-12049: --- Hi [~daryn] As this jira is marked for 3.2 as a Critical, cud u pls help to take this forward or move out if its not feasible to finish in coming weeks. 3.2 code freeze date is nearby in a weeks. Kindly help to check the same. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353288#comment-16353288 ] Wangda Tan commented on HDFS-12049: --- We plan to start merge vote of 3.1.0 on Feb 18, please let me know if any plan to finish this by Feb 18 or we need to move it to 3.2.0. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197711#comment-16197711 ] Subru Krishnan commented on HDFS-12049: --- I am moving this to 3.1.0 given the lack of activity. Feel free to revert if anyone has concerns. Thanks! > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108222#comment-16108222 ] Junping Du commented on HDFS-12049: --- Moved. Thanks Daryn. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098926#comment-16098926 ] Daryn Sharp commented on HDFS-12049: I'm ok with it. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN
[ https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095329#comment-16095329 ] Junping Du commented on HDFS-12049: --- It seems to be a long term existing issues and a bit risky/complicated to fix in a maint release. [~daryn], I am inclined to move it to 2.9 release if you also agree. > Recommissioning live nodes stalls the NN > > > Key: HDFS-12049 > URL: https://issues.apache.org/jira/browse/HDFS-12049 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Daryn Sharp >Priority: Critical > > A node refresh will recommission included nodes that are alive and in > decommissioning or decommissioned state. The recommission will scan all > blocks on the node, find over replicated blocks, chose an excess, queue an > invalidate. > The process is expensive and worsened by overhead of storage types (even when > not in use). It can be especially devastating because the write lock is held > for the entire node refresh. _Recommissioning 67 nodes with ~500k > blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org