[jira] [Commented] (HDFS-7642) NameNode should periodically log DataNode decommissioning progress
[ https://issues.apache.org/jira/browse/HDFS-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543411#comment-15543411 ] Andrew Wang commented on HDFS-7642: --- Thanks for working on this Sean, one meta comment and then some code-related ones: What normally happens is that decom gets stuck at the end because of open-for-write files. So, as an operator, often what you want to know is: * Is this datanode still making progress? * If not, is it blocked on open-for-write files? What are these files? Which client is keeping these files open? I'm not sure that adding more logging really helps with this. We already have logging in logBlockReplicationInfo that gives you similar status information, but the remaining gaps are in understanding the rate of decommissioning (which might be better addressed with per-DN rate metrics) and in some debug tool that dumps the open-for-write files for a DN and the corresponding clients who own the file leases (HDFS-10480 is along those lines). What do you think? Code related: * Can we make the new class static? * We can use primitives (int) rather than objects (Integer) for better efficiency * Recommend we change this to debug logging, decom can take hours and be done on 10s of nodes at a time, printing like this can be spammy * It would also be useful to track when this node was set to "decommissioning" status, so you can judge the rate of progress. > NameNode should periodically log DataNode decommissioning progress > -- > > Key: HDFS-7642 > URL: https://issues.apache.org/jira/browse/HDFS-7642 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Sean Mackrory >Priority: Minor > Attachments: HDFS-7642.001.patch > > > We've see a case where the decommissioning was stuck due to some files have > more replicas then DNs. HDFS-5662 fixes this particular issue but there are > other use cases where the decommissioning process might get stuck or slow > down. Some monitoring / logging will help debugging those issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7642) NameNode should periodically log DataNode decommissioning progress
[ https://issues.apache.org/jira/browse/HDFS-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536954#comment-15536954 ] Zhe Zhang commented on HDFS-7642: - [~mackrorysd] Sure! Thanks for the interest. Unassigning myself now. > NameNode should periodically log DataNode decommissioning progress > -- > > Key: HDFS-7642 > URL: https://issues.apache.org/jira/browse/HDFS-7642 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Zhe Zhang >Priority: Minor > > We've see a case where the decommissioning was stuck due to some files have > more replicas then DNs. HDFS-5662 fixes this particular issue but there are > other use cases where the decommissioning process might get stuck or slow > down. Some monitoring / logging will help debugging those issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7642) NameNode should periodically log DataNode decommissioning progress
[ https://issues.apache.org/jira/browse/HDFS-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536701#comment-15536701 ] Sean Mackrory commented on HDFS-7642: - Hey [~zhz] - I'd like to work on this, if that's alright with you. Since it's been quite a few months, I assume nothing is actively in progress here? I would probably implement this somewhere inside DecommissionManager.Monitor.run() > NameNode should periodically log DataNode decommissioning progress > -- > > Key: HDFS-7642 > URL: https://issues.apache.org/jira/browse/HDFS-7642 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Zhe Zhang >Priority: Minor > > We've see a case where the decommissioning was stuck due to some files have > more replicas then DNs. HDFS-5662 fixes this particular issue but there are > other use cases where the decommissioning process might get stuck or slow > down. Some monitoring / logging will help debugging those issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org