[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927314#comment-16927314 ] Chen Zhang edited comment on HDFS-12288 at 9/11/19 7:26 AM: Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code according previous discussion, and uploaded patch v3, it's not a complete patch, only a draft without tests. {quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery threads}}. We just need to have another metric or member variable to track currently running {{Block recovery threads}}. The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also create {{Packet Responder thread}} {quote} Actually not all the DataXceiver thread creates PacketResponder thread, only the xceiver processing WRITE_BLOCK operation will create a PacketResponder thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and {{dataNodeBlockRecoveryWorkerCount}} I think the variable name {{xceiverCount}} in the {{HeartbeatRequestProto}} also need to change to another name(e.g. transferThreadCount), but changing this variable name affects lots of logic in NameNode side, including {{FSNameSystem}}, {{DataNodeManager}}, {{BlockManager}}, {{HeartBeatManager}}, {{DataNodeStats}} and so on, this variable name is used everywhere, do you think it's necessary to change it? was (Author: zhangchen): Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code according previous discussion, and uploaded patch v3, it's not a complete patch, only a draft without tests. {quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery threads}}. We just need to have another metric or member variable to track currently running {{Block recovery threads}}. The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also create {{Packet Responder thread}} {quote} Actually not all the DataXceiver thread creates PacketResponder thread, only the xceiver processing WRITE_BLOCK operation will create a PacketResponder thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and {{dataNodeBlockRecoveryWorkerCount}} > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124333#comment-16124333 ] Lukas Majercak edited comment on HDFS-12288 at 8/12/17 12:19 AM: - It does not represent number of active threads in the process either. I am currently looking at a DataNode that shows: {{xceiverCount}} = 15 and JVM metrics show: "ThreadsRunnable" : 64, "ThreadsBlocked" : 0, "ThreadsWaiting" : 13, "ThreadsTimedWaiting" : 54, "ThreadsTerminated" : 0, was (Author: lukmajercak): It does not represent number of active threads either. I am currently looking at a DataNode that shows: {{xceiverCount}} = 15 and JVM metrics show: "ThreadsRunnable" : 64, "ThreadsBlocked" : 0, "ThreadsWaiting" : 13, "ThreadsTimedWaiting" : 54, "ThreadsTerminated" : 0, > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124039#comment-16124039 ] Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:45 PM: Correct me if I'm wrong, but {{BlockReceiver}} and {{PacketResponder}} are created for each DataXceiver, therefore, I don't think it makes sense to include them in the calculation. # DataXceiver.run() # addPeer (increase counter) # processOp -> (write/replace) # new BlockReceiver # receiveBlock # new PacketResponder was (Author: lukmajercak): Correct me if I'm wrong, but {{BlockReceiver}} and {{PacketResponder}} are created for each DataXceiver, therefore, I don't think it makes sense to include them in the calculation. # DataXceiver.run() # addPeer (increase counter) # processOp -> (write/replace) # new BlockReceiver # receiveBlock / writeBlock > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124121#comment-16124121 ] Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:34 PM: For {{BlockRecoveryWorker}} I agree, this should be included. But then, so should be {{DataNode.DataTransfer}} in my opinion, which is currently only represented by {{DataNode.xmitsInProgress}}. Note that this one is not even included in the current calculation using {{threadGroup}} afaik. was (Author: lukmajercak): For {{BlockRecoveryWorker}} I agree, this should be included. But then, so should be {{DataNode.DataTransfer}} in my opinion, which is currently only represented by {{DataNode.xmitsInProgress}}. > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124121#comment-16124121 ] Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:34 PM: For {{BlockRecoveryWorker}} I agree, this should be included. But then, so should be {{DataNode.DataTransfer}} in my opinion, which is currently only represented by {{DataNode.xmitsInProgress}}. was (Author: lukmajercak): For {{BlockRecoveryWorker}} I agree, this should be included. But then, so should be DataTransfer in my opinion, which is currently only represented by {{DataNode.xmitsInProgress}}. > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123483#comment-16123483 ] Rushabh S Shah edited comment on HDFS-12288 at 8/11/17 3:34 PM: Thanks [~lukmajercak] for reporting the bug and patch. 1. {noformat} // the load for writers is 2 because both the write xceiver & packet // responder threads are counted in the load expectedTotalLoad += fileRepl; expectedInServiceLoad += fileRepl; {noformat} This comment is there for a reason. When we receive a block, it creates 2 thread. One is DataXceiver thread and other is Packet Responder thread. If we are using {{DataNodeMetrics#getDataNodeActiveXceiversCount}} as a replacement for {{activeThreadCount}} then we need to add {{PacketResponderThread}} to {{DataNodeMetrics#dataNodeActiveXceiversCount}} otherwise we will create twice number of threads compared to today. 2. {noformat} -return threadGroup == null ? 0 : threadGroup.activeCount(); +return metrics == null ? 0 : metrics.getDataNodeActiveXceiversCount(); {noformat} Need to check once more that is there a possibility that datanode can start without initializing the metrics. Looking at the code, I think its not possible but just need to make sure. was (Author: shahrs87): {noformat} // the load for writers is 2 because both the write xceiver & packet // responder threads are counted in the load expectedTotalLoad += fileRepl; expectedInServiceLoad += fileRepl; {noformat} This comment is there for a reason. When we receive a block, it creates 2 thread. One is DataXceiver thread and other is Packet Responder thread. If we are using {{DataNodeMetrics#getDataNodeActiveXceiversCount}} as a replacement for {{activeThreadCount}} then we need to add {{PacketResponderThread}} to {{DataNodeMetrics#dataNodeActiveXceiversCount}} otherwise we will create twice number of threads compared to today. {noformat} -return threadGroup == null ? 0 : threadGroup.activeCount(); +return metrics == null ? 0 : metrics.getDataNodeActiveXceiversCount(); {noformat} Need to check once more that is there a possibility that datanode can start without initializing the metrics. Looking at the code, I think its not possible but just need to make sure. > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lukas Majercak > Attachments: HDFS-12288.001.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org