[ 
https://issues.apache.org/jira/browse/YARN-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368489#comment-15368489
 ] 

Li Lu commented on YARN-5340:
-----------------------------

Thanks for reporting this issue [[email protected]]! This is a very 
very interesting discovery. I did some debug on this issue and found out the 
the direct reason for the missing fields is authentication failure. The 
original user failed to get authentication to get the app report. Checked into 
the ATS returned message, I can see something like this:
{code}
{"events":[{"timestamp":1467931672057,"eventtype":"YARN_APPLICATION_FINISHED","eventinfo":{"YARN_APPLICATION_LATEST_APP_ATTEMPT":"appattempt_1467931619679_0001_000001","YARN_APPLICATION_FINAL_STATUS":"SUCCEEDED","YARN_APPLICATION_DIAGNOSTICS_INFO":"","YARN_APPLICATION_STATE":"FINISHED"}},{"timestamp":1467931652492,"eventtype":"YARN_APPLICATION_STATE_UPDATED","eventinfo":{"YARN_APPLICATION_STATE":"RUNNING"}},{"timestamp":1467931641896,"eventtype":"YARN_APPLICATION_ACLS_UPDATED","eventinfo":{}}],"entitytype":"YARN_APPLICATION","entity":"application_1467931619679_0001","starttime":1467931641896,"domain":"DEFAULT","otherinfo":{"YARN_APPLICATION_MEM_METRIC":290014,"YARN_APPLICATION_CPU_METRIC":74,"YARN_APPLICATION_VIEW_ACLS":"hrt_5
 viewtestgroup"},"primaryfilters":{},"relatedentities":{}}
{code}

Note that the application creation information has been missing in the returned 
information. I found that in the level db, there are two <entityType, 
timestamp, entityId> tuples created with application 
application_1467931619679_0001, with two different timestamps. The application 
creation message is associated with a different timestamp. 

Checking the code of rolling leveldb, I can see both call-sites of 
RollingLevelDBTimelineStore#getAndSetStartTime is not properly synchronized, 
although in the comments it says that it "Should only be called when a lock has 
been obtained on the entity. " Then for two events on the same application 
arrive the timeline server concurrently, something like this may happen:
1. put1 checks existing timestamp for the application, no result. 
2. put2 checks existing timestamp for the application, no result. 
3. put1 set the application entity's timestamp to be its own timestamp 
4. put2 override the application entity's timestamp to be its own timestamp. 

After the process, put1 will write its data to a key (<entityType, timestamp, 
entityId>) that has a stale timestamp, which will never be read out since the 
time stamp is overridden by put 2. 

The original LeveldbTimelineStore does not have this problem, because it always 
grab a lock when it performs getAndSetStartTime. 

With regard to fix, probably making getAndSetStartTime synchronized will fix 
the problem. I'm wondering that making checkStartTimeInDb to be synchronized 
would also to the trick (since it's the only place in the process to have a 
read-then-update semantic). 

[~jeagles] I know you're an expert on rolling leveldb's source code, so if you 
have any free bandwidth, I truly appreciate your suggestions here. Thanks! 

> App Name/User/RPC Port/AM Host info is missing from ATS web service or YARN 
> CLI's app info
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-5340
>                 URL: https://issues.apache.org/jira/browse/YARN-5340
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Sumana Sathish
>            Assignee: Li Lu
>            Priority: Critical
>
> App Name/User/RPC Port/AM Host info is missing from ATS web service or YARN 
> CLI's app info
> {code}
> RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn --config 
> /tmp/hadoopConf application -status application_1467931619679_0001
> Application Report :
> Application-Id : application_1467931619679_0001
> Application-Name : null
> Application-Type : null
> User : null
> Queue : null
> Application Priority : null
> Start-Time : 0
> Finish-Time : 1467931672057
> Progress : 100%
> State : FINISHED
> Final-State : SUCCEEDED
> Tracking-URL : N/A
> RPC Port : -1
> AM Host : N/A
> Aggregate Resource Allocation : 290014 MB-seconds, 74 vcore-seconds
> Log Aggregation Status : N/A
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to