[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-20 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501
 ] 

Weiwei Yang edited comment on YARN-8995 at 8/20/19 4:08 PM:


Hi [~zhuqi]/[~Tao Yang]

Thanks for working on this. Patch LGTM, I might be just a little picky on the 
configuration name, right now it is not straightforward to me.

"The interval of queue size (in thousands) for printing the boom queue event 
type details."

How about something like the following for the description, if I understand 
this correctly:

"The threshold used to trigger the logging of event types and counts in RM's 
main event dispatcher. Default length is 5000, which means RM will print events 
info when the queue size cumulatively reaches 5000 every time.  Such info can 
be used to reveal what kind of events that RM is stuck at processing mostly, it 
can help to narrow down certain performance issues."

And also, the config name is better to be something like 
{{yarn.dispatcher.print-events-info.threshold}}, you don't need to use 
in-thousands here, as several thousand is still human-readable.

Does that make sense?

Thanks


was (Author: cheersyang):
Hi [~zhuqi]/[~Tao Yang]

Thanks for working on this. Patch LGTM, I might be just a little picky on the 
configuration name, right now it is not straightforward to me.
{noformat}
The interval of queue size (in thousands) for printing the boom queue event 
type details.
{noformat}
How about something like the following for the description, if I understand 
this correctly:
{noformat}
The threshold used to trigger the logging of event types and counts in RM's 
main event dispatcher. Default length is 5000, which means RM will print events 
info when the queue size cumulatively reaches 5000 every time.  Such info can 
be used to reveal what kind of events that RM is stuck at processing mostly, it 
can help to narrow down certain performance issues.
{noformat}
And also, the config name is better to be something like 
{{yarn.dispatcher.print-events-info.threshold}}, you don't need to use 
in-thousands here, as several thousand is still human-readable.

Does that make sense?

Thanks

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866
 ] 

Tao Yang edited comment on YARN-8995 at 6/13/19 9:29 AM:
-

I did a simple test (details in TestStreamPerf.java) on performance comparison 
between sequential stream and parallel stream in a similar scenario: count a 
blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it 
seems that parallel stream indeed lead to more overhead than sequential stream, 
results of this test are as follows (suffix "_S" refers to sequential stream 
and suffix "_PS" refers to parallel stream):
{noformat}
TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 1, GC.time: 0.01, time.total: 0.64, time.warmup: 0.31, time.bench: 
0.32
TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.02 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.37, time.warmup: 0.15, time.bench: 
0.22
TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.08, time.warmup: 0.05, time.bench: 
0.04
TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 
0.03
TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.01
TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.00
TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.07 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 1.03, time.warmup: 0.37, time.bench: 
0.66
TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.04 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.70, time.warmup: 0.25, time.bench: 
0.45
{noformat}


was (Author: tao yang):
I did a simple test on performance comparison between sequential stream and 
parallel stream in a similar scenario: count a blocking queue with 100 distinct 
keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed 
lead to more overhead than sequential stream, results of this test are as 
follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to 
parallel stream):
{noformat}
TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 
0.00
TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.01
TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 
0.03
TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 
0.05
TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 
0.26
TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 
0.36
TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 
0.50
TestStreamPerf.test_100_200w_PS: [measured 10 out of 

[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821
 ] 

Tao Yang edited comment on YARN-8995 at 6/13/19 8:27 AM:
-

Thanks [~zhuqi] for updating the patch.
Comments for the new patch:
* Sorry to have made a mistake in my last comment, serviceInit is a more proper 
place to initialize conf, then you can remove the initial value for 
detailsInterval field.
* There's no need to separate name with double "\_" for "...EVENTS__INFO...", 
"...EVENTS_INFO..." is ok. The annotation "The interval thousands of queue 
size" can be replaced as "The interval of queue size (in thousands)".
* For parallelStream, overhead is involved in splitting the work among several 
threads and joining or merging the results, I prefer using sequential stream in 
this scenario which has no I/O operations and only need to count for event 
types. Moreover, we can use groupingBy API like this: 
{{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), 
Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or 
Collectors#toMap.


was (Author: tao yang):
Thanks [~zhuqi] for updating the patch.
Comments for the new patch:
* Sorry to have made a mistake in my last comment, serviceInit is a more proper 
place to initialize conf, then you can remove the initial value for 
detailsInterval field.
* There's no need to separate name with double "_" for "...EVENTS__INFO...", 
"...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can 
be replaced as "The interval of ... (in thousands)".
* For parallelStream, overhead is involved in splitting the work among several 
threads and joining or merging the results, I prefer using sequential stream in 
this scenario which has no I/O operations and only need to count for event 
types. Moreover, we can use groupingBy API like this: 
{{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), 
Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or 
Collectors#toMap.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-04 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855280#comment-16855280
 ] 

Tao Yang edited comment on YARN-8995 at 6/4/19 7:57 AM:


Thanks [~zhuqi] for the patch.

I prefer not maintain a global map (Map eventTypeRecord) which will 
be updated twice (in & out) for every event, after all it's necessary only when 
something goes wrong which could rarely happen. I think count events in 
realtime may be enough, Thoughts?

For the latest event, also we can record it only when necessary, for example, 
use a boolean flag to control whether to print the next event and should print 
only one event at a time.
{quote}now i hard code to 5000
{quote}
I suppose it should be configurable, you can set 5000 as default.
{quote}if we need print the event type size in order?
{quote}
I'm not sure what you mean, for example: "E1:3,E2:2,E1:1,..." when event types 
in queue are "E1,E1,E1,E2,E2,E1,..." ? I think it's unnecessary if it is.


was (Author: tao yang):
Thanks [~zhuqi] for the patch.

I prefer not maintain a global map (Map eventTypeRecord) which will 
be updated twice (in & out) for every event, after all it's necessary only when 
something goes wrong which could rarely happen. I think count events in 
realtime may be enough, Thoughts?

For the latest event, also we can record it only when necessary, for example, 
use a boolean flag to control whether to record the next event and should 
record one event at a time.

{quote}

now i hard code to 5000

{quote}

I suppose it should be configurable, you can set 5000 as default.

{quote}

if we need print the event type size in order?

{quote}

I'm not sure what you mean, for example: "E1:3,E2:2,E1:1,..." when event types 
in queue are "E1,E1,E1,E2,E2,E1,..." ? I think it's unnecessary if it is.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2018-11-09 Thread Wanqiang Ji (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681561#comment-16681561
 ] 

Wanqiang Ji edited comment on YARN-8995 at 11/9/18 3:03 PM:


+1

I am looking forward to seeing this patch.


was (Author: jiwq):
+1

I'm looking forward to seeing this patch.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2018-11-09 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681105#comment-16681105
 ] 

zhuqi edited comment on YARN-8995 at 11/9/18 9:06 AM:
--

Hi [~cheersyang] 

Thanks for your reply, i think not only the queue size, we can also add a 
eventMetrics class to monitor the health of cluster's all event dispatchers.


was (Author: zhuqi):
Hi [~cheersyang] 

Thanks for your reply, i think not only the queue size, we can also add a 
eventMetrics class to monitor the health of cluster's all event dispachers.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2018-11-09 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681105#comment-16681105
 ] 

zhuqi edited comment on YARN-8995 at 11/9/18 9:05 AM:
--

Hi [~cheersyang] 

Thanks for your reply, i think not only the queue size, we can also add a 
eventMetrics class to monitor the health of cluster's all event dispachers.


was (Author: zhuqi):
Hi [~cheersyang] 

Thanks for your reply, i think not only the queue size, we can also add a 
eventMetrics class to monitor the health of cluster's all event dispacher.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org