[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927314#comment-16927314
 ] 

Chen Zhang edited comment on HDFS-12288 at 9/11/19 7:26 AM:


Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code 
according previous discussion, and uploaded patch v3, it's not a complete 
patch, only a draft without tests.
{quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the 
sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery 
threads}}.
We just need to have another metric or member variable to track currently 
running {{Block recovery threads}}.
The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also 
create {{Packet Responder thread}}
{quote}
Actually not all the DataXceiver thread creates PacketResponder thread, only 
the xceiver processing WRITE_BLOCK operation will create a PacketResponder 
thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and 
{{dataNodeBlockRecoveryWorkerCount}}

I think the variable name {{xceiverCount}} in the {{HeartbeatRequestProto}} 
also need to change to another name(e.g. transferThreadCount), but changing 
this variable name affects lots of logic in NameNode side, including 
{{FSNameSystem}}, {{DataNodeManager}}, {{BlockManager}}, {{HeartBeatManager}}, 
{{DataNodeStats}} and so on, this variable name is used everywhere, do you 
think it's necessary to change it?


was (Author: zhangchen):
Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code 
according previous discussion, and uploaded patch v3, it's not a complete 
patch, only a draft without tests.
{quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the 
sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery 
threads}}.
We just need to have another metric or member variable to track currently 
running {{Block recovery threads}}.
The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also 
create {{Packet Responder thread}}
{quote}
Actually not all the DataXceiver thread creates PacketResponder thread, only 
the xceiver processing WRITE_BLOCK operation will create a PacketResponder 
thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and 
{{dataNodeBlockRecoveryWorkerCount}}

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2017-08-11 Thread Lukas Majercak (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124333#comment-16124333
 ] 

Lukas Majercak edited comment on HDFS-12288 at 8/12/17 12:19 AM:
-

It does not represent number of active threads in the process either. I am 
currently looking at a DataNode that shows:
{{xceiverCount}} = 15 

and JVM metrics show:

"ThreadsRunnable" : 64,
"ThreadsBlocked" : 0,
"ThreadsWaiting" : 13,
"ThreadsTimedWaiting" : 54,
"ThreadsTerminated" : 0,


was (Author: lukmajercak):
It does not represent number of active threads either. I am currently looking 
at a DataNode that shows:
{{xceiverCount}} = 15 

and JVM metrics show:

"ThreadsRunnable" : 64,
"ThreadsBlocked" : 0,
"ThreadsWaiting" : 13,
"ThreadsTimedWaiting" : 54,
"ThreadsTerminated" : 0,

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2017-08-11 Thread Lukas Majercak (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124039#comment-16124039
 ] 

Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:45 PM:


Correct me if I'm wrong, but {{BlockReceiver}} and {{PacketResponder}} are 
created for each DataXceiver, therefore, I don't think it makes sense to 
include them in the calculation.

# DataXceiver.run() 
# addPeer (increase counter)
# processOp -> (write/replace)
# new BlockReceiver 
# receiveBlock 
# new PacketResponder
   


was (Author: lukmajercak):
Correct me if I'm wrong, but {{BlockReceiver}} and {{PacketResponder}} are 
created for each DataXceiver, therefore, I don't think it makes sense to 
include them in the calculation.

# DataXceiver.run() 
# addPeer (increase counter)
# processOp -> (write/replace)
# new BlockReceiver 
# receiveBlock / writeBlock
 
   

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2017-08-11 Thread Lukas Majercak (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124121#comment-16124121
 ] 

Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:34 PM:


For {{BlockRecoveryWorker}} I agree, this should be included. But then, so 
should be {{DataNode.DataTransfer}} in my opinion, which is currently only 
represented by {{DataNode.xmitsInProgress}}. Note that this one is not even 
included in the current calculation using {{threadGroup}} afaik.


was (Author: lukmajercak):
For {{BlockRecoveryWorker}} I agree, this should be included. But then, so 
should be {{DataNode.DataTransfer}} in my opinion, which is currently only 
represented by {{DataNode.xmitsInProgress}}.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2017-08-11 Thread Lukas Majercak (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124121#comment-16124121
 ] 

Lukas Majercak edited comment on HDFS-12288 at 8/11/17 9:34 PM:


For {{BlockRecoveryWorker}} I agree, this should be included. But then, so 
should be {{DataNode.DataTransfer}} in my opinion, which is currently only 
represented by {{DataNode.xmitsInProgress}}.


was (Author: lukmajercak):
For {{BlockRecoveryWorker}} I agree, this should be included. But then, so 
should be DataTransfer in my opinion, which is currently only represented by 
{{DataNode.xmitsInProgress}}.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2017-08-11 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123483#comment-16123483
 ] 

Rushabh S Shah edited comment on HDFS-12288 at 8/11/17 3:34 PM:


Thanks [~lukmajercak] for reporting the bug and patch.
1.
{noformat}
// the load for writers is 2 because both the write xceiver & packet
// responder threads are counted in the load
expectedTotalLoad += fileRepl;
expectedInServiceLoad += fileRepl;
{noformat}
This comment is there for a reason.
When we receive a block, it creates 2 thread. One is DataXceiver thread and 
other is Packet Responder thread.
If we are using {{DataNodeMetrics#getDataNodeActiveXceiversCount}} as a 
replacement for {{activeThreadCount}} then we need to add 
{{PacketResponderThread}} to {{DataNodeMetrics#dataNodeActiveXceiversCount}} 
otherwise we will create twice number of threads compared to today.

2.
{noformat}
-return threadGroup == null ? 0 : threadGroup.activeCount();
+return metrics == null ? 0 : metrics.getDataNodeActiveXceiversCount();
{noformat}
Need to check once more that is there a possibility that datanode can start 
without initializing the metrics.
Looking at the code, I think its not possible but just need to make sure.


was (Author: shahrs87):
{noformat}
// the load for writers is 2 because both the write xceiver & packet
// responder threads are counted in the load
expectedTotalLoad += fileRepl;
expectedInServiceLoad += fileRepl;
{noformat}
This comment is there for a reason.
When we receive a block, it creates 2 thread. One is DataXceiver thread and 
other is Packet Responder thread.
If we are using {{DataNodeMetrics#getDataNodeActiveXceiversCount}} as a 
replacement for {{activeThreadCount}} then we need to add 
{{PacketResponderThread}} to {{DataNodeMetrics#dataNodeActiveXceiversCount}} 
otherwise we will create twice number of threads compared to today.

{noformat}
-return threadGroup == null ? 0 : threadGroup.activeCount();
+return metrics == null ? 0 : metrics.getDataNodeActiveXceiversCount();
{noformat}
Need to check once more that is there a possibility that datanode can start 
without initializing the metrics.
Looking at the code, I think its not possible but just need to make sure.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
> Attachments: HDFS-12288.001.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org