[
https://issues.apache.org/jira/browse/YARN-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368489#comment-15368489
]
Li Lu commented on YARN-5340:
-----------------------------
Thanks for reporting this issue [[email protected]]! This is a very
very interesting discovery. I did some debug on this issue and found out the
the direct reason for the missing fields is authentication failure. The
original user failed to get authentication to get the app report. Checked into
the ATS returned message, I can see something like this:
{code}
{"events":[{"timestamp":1467931672057,"eventtype":"YARN_APPLICATION_FINISHED","eventinfo":{"YARN_APPLICATION_LATEST_APP_ATTEMPT":"appattempt_1467931619679_0001_000001","YARN_APPLICATION_FINAL_STATUS":"SUCCEEDED","YARN_APPLICATION_DIAGNOSTICS_INFO":"","YARN_APPLICATION_STATE":"FINISHED"}},{"timestamp":1467931652492,"eventtype":"YARN_APPLICATION_STATE_UPDATED","eventinfo":{"YARN_APPLICATION_STATE":"RUNNING"}},{"timestamp":1467931641896,"eventtype":"YARN_APPLICATION_ACLS_UPDATED","eventinfo":{}}],"entitytype":"YARN_APPLICATION","entity":"application_1467931619679_0001","starttime":1467931641896,"domain":"DEFAULT","otherinfo":{"YARN_APPLICATION_MEM_METRIC":290014,"YARN_APPLICATION_CPU_METRIC":74,"YARN_APPLICATION_VIEW_ACLS":"hrt_5
viewtestgroup"},"primaryfilters":{},"relatedentities":{}}
{code}
Note that the application creation information has been missing in the returned
information. I found that in the level db, there are two <entityType,
timestamp, entityId> tuples created with application
application_1467931619679_0001, with two different timestamps. The application
creation message is associated with a different timestamp.
Checking the code of rolling leveldb, I can see both call-sites of
RollingLevelDBTimelineStore#getAndSetStartTime is not properly synchronized,
although in the comments it says that it "Should only be called when a lock has
been obtained on the entity. " Then for two events on the same application
arrive the timeline server concurrently, something like this may happen:
1. put1 checks existing timestamp for the application, no result.
2. put2 checks existing timestamp for the application, no result.
3. put1 set the application entity's timestamp to be its own timestamp
4. put2 override the application entity's timestamp to be its own timestamp.
After the process, put1 will write its data to a key (<entityType, timestamp,
entityId>) that has a stale timestamp, which will never be read out since the
time stamp is overridden by put 2.
The original LeveldbTimelineStore does not have this problem, because it always
grab a lock when it performs getAndSetStartTime.
With regard to fix, probably making getAndSetStartTime synchronized will fix
the problem. I'm wondering that making checkStartTimeInDb to be synchronized
would also to the trick (since it's the only place in the process to have a
read-then-update semantic).
[~jeagles] I know you're an expert on rolling leveldb's source code, so if you
have any free bandwidth, I truly appreciate your suggestions here. Thanks!
> App Name/User/RPC Port/AM Host info is missing from ATS web service or YARN
> CLI's app info
> ------------------------------------------------------------------------------------------
>
> Key: YARN-5340
> URL: https://issues.apache.org/jira/browse/YARN-5340
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Reporter: Sumana Sathish
> Assignee: Li Lu
> Priority: Critical
>
> App Name/User/RPC Port/AM Host info is missing from ATS web service or YARN
> CLI's app info
> {code}
> RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn --config
> /tmp/hadoopConf application -status application_1467931619679_0001
> Application Report :
> Application-Id : application_1467931619679_0001
> Application-Name : null
> Application-Type : null
> User : null
> Queue : null
> Application Priority : null
> Start-Time : 0
> Finish-Time : 1467931672057
> Progress : 100%
> State : FINISHED
> Final-State : SUCCEEDED
> Tracking-URL : N/A
> RPC Port : -1
> AM Host : N/A
> Aggregate Resource Allocation : 290014 MB-seconds, 74 vcore-seconds
> Log Aggregation Status : N/A
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]