[
https://issues.apache.org/jira/browse/YARN-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172671#comment-15172671
]
Jason Lowe commented on YARN-4747:
----------------------------------
I believe this was triggered by a missing container start event for a given
container finish event. When an application runs for a long time there will be
a corresponding long window between the container start event and container
finish event for the AM container. The timelineserver performs retention based
on entity timestamp, so there will be a long window where the container start
event has been deleted but the container finish event is still present. The
application history code is not prepared to handle that, as only the container
start event has the node hostname and port number information. It blindly
assumes that if a container entity is present in the database then we know both
the host and the port.
Minimally the application history server needs to be hardened to avoid the NPE,
but we may want to add the host and port information to the finish event as
well to allow the history page to continue to provide logs as long as there is
either a container start or container finish event in the database.
> AHS error 500 due to NPE when container start event is missing
> --------------------------------------------------------------
>
> Key: YARN-4747
> URL: https://issues.apache.org/jira/browse/YARN-4747
> Project: Hadoop YARN
> Issue Type: Bug
> Components: timelineserver
> Affects Versions: 2.7.2
> Reporter: Jason Lowe
>
> Saw an error 500 due to a NullPointerException caused by a missing host for
> an AM container. Stacktrace to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)