[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501 ] Weiwei Yang edited comment on YARN-8995 at 8/20/19 4:08 PM: Hi [~zhuqi]/[~Tao Yang] Thanks for working on this. Patch LGTM, I might be just a little picky on the configuration name, right now it is not straightforward to me. "The interval of queue size (in thousands) for printing the boom queue event type details." How about something like the following for the description, if I understand this correctly: "The threshold used to trigger the logging of event types and counts in RM's main event dispatcher. Default length is 5000, which means RM will print events info when the queue size cumulatively reaches 5000 every time. Such info can be used to reveal what kind of events that RM is stuck at processing mostly, it can help to narrow down certain performance issues." And also, the config name is better to be something like {{yarn.dispatcher.print-events-info.threshold}}, you don't need to use in-thousands here, as several thousand is still human-readable. Does that make sense? Thanks was (Author: cheersyang): Hi [~zhuqi]/[~Tao Yang] Thanks for working on this. Patch LGTM, I might be just a little picky on the configuration name, right now it is not straightforward to me. {noformat} The interval of queue size (in thousands) for printing the boom queue event type details. {noformat} How about something like the following for the description, if I understand this correctly: {noformat} The threshold used to trigger the logging of event types and counts in RM's main event dispatcher. Default length is 5000, which means RM will print events info when the queue size cumulatively reaches 5000 every time. Such info can be used to reveal what kind of events that RM is stuck at processing mostly, it can help to narrow down certain performance issues. {noformat} And also, the config name is better to be something like {{yarn.dispatcher.print-events-info.threshold}}, you don't need to use in-thousands here, as several thousand is still human-readable. Does that make sense? Thanks > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866 ] Tao Yang edited comment on YARN-8995 at 6/13/19 9:29 AM: - I did a simple test (details in TestStreamPerf.java) on performance comparison between sequential stream and parallel stream in a similar scenario: count a blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed lead to more overhead than sequential stream, results of this test are as follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to parallel stream): {noformat} TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.01, time.total: 0.64, time.warmup: 0.31, time.bench: 0.32 TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.02 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.37, time.warmup: 0.15, time.bench: 0.22 TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.08, time.warmup: 0.05, time.bench: 0.04 TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 0.03 TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.01 TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.00 TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.07 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 1.03, time.warmup: 0.37, time.bench: 0.66 TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.04 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.70, time.warmup: 0.25, time.bench: 0.45 {noformat} was (Author: tao yang): I did a simple test on performance comparison between sequential stream and parallel stream in a similar scenario: count a blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed lead to more overhead than sequential stream, results of this test are as follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to parallel stream): {noformat} TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00 TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.01 TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 0.03 TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 0.05 TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 0.26 TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 0.36 TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 0.50 TestStreamPerf.test_100_200w_PS: [measured 10 out of
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821 ] Tao Yang edited comment on YARN-8995 at 6/13/19 8:27 AM: - Thanks [~zhuqi] for updating the patch. Comments for the new patch: * Sorry to have made a mistake in my last comment, serviceInit is a more proper place to initialize conf, then you can remove the initial value for detailsInterval field. * There's no need to separate name with double "\_" for "...EVENTS__INFO...", "...EVENTS_INFO..." is ok. The annotation "The interval thousands of queue size" can be replaced as "The interval of queue size (in thousands)". * For parallelStream, overhead is involved in splitting the work among several threads and joining or merging the results, I prefer using sequential stream in this scenario which has no I/O operations and only need to count for event types. Moreover, we can use groupingBy API like this: {{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or Collectors#toMap. was (Author: tao yang): Thanks [~zhuqi] for updating the patch. Comments for the new patch: * Sorry to have made a mistake in my last comment, serviceInit is a more proper place to initialize conf, then you can remove the initial value for detailsInterval field. * There's no need to separate name with double "_" for "...EVENTS__INFO...", "...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can be replaced as "The interval of ... (in thousands)". * For parallelStream, overhead is involved in splitting the work among several threads and joining or merging the results, I prefer using sequential stream in this scenario which has no I/O operations and only need to count for event types. Moreover, we can use groupingBy API like this: {{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or Collectors#toMap. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855280#comment-16855280 ] Tao Yang edited comment on YARN-8995 at 6/4/19 7:57 AM: Thanks [~zhuqi] for the patch. I prefer not maintain a global map (Map eventTypeRecord) which will be updated twice (in & out) for every event, after all it's necessary only when something goes wrong which could rarely happen. I think count events in realtime may be enough, Thoughts? For the latest event, also we can record it only when necessary, for example, use a boolean flag to control whether to print the next event and should print only one event at a time. {quote}now i hard code to 5000 {quote} I suppose it should be configurable, you can set 5000 as default. {quote}if we need print the event type size in order? {quote} I'm not sure what you mean, for example: "E1:3,E2:2,E1:1,..." when event types in queue are "E1,E1,E1,E2,E2,E1,..." ? I think it's unnecessary if it is. was (Author: tao yang): Thanks [~zhuqi] for the patch. I prefer not maintain a global map (Map eventTypeRecord) which will be updated twice (in & out) for every event, after all it's necessary only when something goes wrong which could rarely happen. I think count events in realtime may be enough, Thoughts? For the latest event, also we can record it only when necessary, for example, use a boolean flag to control whether to record the next event and should record one event at a time. {quote} now i hard code to 5000 {quote} I suppose it should be configurable, you can set 5000 as default. {quote} if we need print the event type size in order? {quote} I'm not sure what you mean, for example: "E1:3,E2:2,E1:1,..." when event types in queue are "E1,E1,E1,E2,E2,E1,..." ? I think it's unnecessary if it is. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681561#comment-16681561 ] Wanqiang Ji edited comment on YARN-8995 at 11/9/18 3:03 PM: +1 I am looking forward to seeing this patch. was (Author: jiwq): +1 I'm looking forward to seeing this patch. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681105#comment-16681105 ] zhuqi edited comment on YARN-8995 at 11/9/18 9:06 AM: -- Hi [~cheersyang] Thanks for your reply, i think not only the queue size, we can also add a eventMetrics class to monitor the health of cluster's all event dispatchers. was (Author: zhuqi): Hi [~cheersyang] Thanks for your reply, i think not only the queue size, we can also add a eventMetrics class to monitor the health of cluster's all event dispachers. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681105#comment-16681105 ] zhuqi edited comment on YARN-8995 at 11/9/18 9:05 AM: -- Hi [~cheersyang] Thanks for your reply, i think not only the queue size, we can also add a eventMetrics class to monitor the health of cluster's all event dispachers. was (Author: zhuqi): Hi [~cheersyang] Thanks for your reply, i think not only the queue size, we can also add a eventMetrics class to monitor the health of cluster's all event dispacher. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org