Zhijie Shen commented on YARN-3334:

bq. Do we expect some info extra is necessary for ContainerEntity to set? If 
not, I suspect some bug (NPE, etc.) could be hidden in putEntity for 

What's the error message? TestTimelineServiceRecords should have covered 
ContainerEntity. If it's some trivial bug, would you mind taking care of it?

1. For debugging purpose, can we have an info record no matter the publishing 
timeline data is enabled or not?
211         if (!publishContainerMetricsToTimelineService) {
212           LOG.warn("NodeManager has not been configured to publish 
container " +
213               "metrics to Timeline Service V2");
214         }

2. Can we group all the other code related to composing timeline data and put 
it into this if block too? Otherwise, if it's false here, that code is not 
necessary to be executed. And we should catch the timeline operation's 
exceptions, as they shouldn't fail the monitoring service.
608               if (publishContainerMetricsToTimelineService) {
609                 putEntityWithoutBlocking(timelineClient, entity);
610               }

3. In addition to remove it from the map, we need to stop the client.
437           app.context.getTimelineClients().remove(app.getAppId());

4. IMHO, there's a bug here. ResourceTrackerService will the service address of 
all the active app. Therefore, according to getTimelineClient logic, it will 
create clients for each app then, thought most of them won't be used on this 
NM. I think, we should decouple timeline client creation and getting. Creation 
should happen at the life cycle of ApplicationImpl, and the client will be put 
into context.clients. Here we just update the service address if application's 
client exists in context.clients.
699               TimelineClient client = context.getTimelineClient(appId);
700               client.setTimelineServiceAddress(collectorAddr);

bq. Are there any fundamental challenges to prevent us creating a full timeline 
series here? A similar question applies to memory metrics, too.

One time series point per request:
* Pros: Realtime, simple implementation at NM side.
* Cons: Bigger overhead.
Multiple time series points per request:
* Pros: Smaller overhead.
* Cons: Longer latency and cannot support realtime use case (assume a big 
number of points), more logic at NM side to buffer  time series data.

Personally, I incline to the the simple approach at this moment.

> [Event Producers] NM TimelineClient life cycle handling and container metrics 
> posting to new timeline service.
> --------------------------------------------------------------------------------------------------------------
>                 Key: YARN-3334
>                 URL: https://issues.apache.org/jira/browse/YARN-3334
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: YARN-2928
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, 
> YARN-3334-v2.patch, YARN-3334-v3.patch
> After YARN-3039, we have service discovery mechanism to pass app-collector 
> service address among collectors, NMs and RM. In this JIRA, we will handle 
> service address setting for TimelineClients in NodeManager, and put container 
> metrics to the backend storage.

This message was sent by Atlassian JIRA

Reply via email to