[jira] [Updated] (YARN-7175) Log collection fails when a container is acquired but not launched on NM

Yesha Vora (JIRA) Thu, 07 Sep 2017 18:16:15 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yesha Vora updated YARN-7175:
-----------------------------
    Description: 
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>" 
cli in a loop for 2-3 mins to check Log_aggreagtion status. 

I'm noticing that log_aggregation status remains in "RUNNING" and eventually 
ends up with "TIMED_OUT" status.  

This situation happens when an application has acquired a container but it is 
not launched on NM. 

This scenario should be better handled and should not cause this delay to get 
the application log. 

Example: application_1502070770869_0012
application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs were 
not available till 2017-08-07 03:08:36.
{code}
RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
application_1502070770869_0012
17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History 
server at host5/xxx.xx.xx.xx:10200
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking 
for the active RM in [rm1, rm2]...
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found 
active RM [rm1]
Application Report :
Application-Id : application_1502070770869_0012
Application-Name : ml.R
Application-Type : SPARK
User : hrt_qa
Queue : default
Application Priority : null
Start-Time : 1502075166506
Finish-Time : 1502075198997
Progress : 100%
State : FINISHED
2017-08-07 03:08:37,770|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Final-State : SUCCEEDED
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Tracking-URL : 
ctr-e134-1499953498516-83705-01-000005.hwx.site:18080/history/application_1502070770869_0012/1
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|RPC Port : 0
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM Host : 172.27.21.204
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Aggregate Resource Allocation 
: 174680 MB-seconds, 84 vcore-seconds
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Log Aggregation Status : 
RUNNING
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Diagnostics :
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Unmanaged Application : false
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Application Node Label 
Expression : <Not set>
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM container Node Label 
Expression : <DEFAULT_PARTITION>
2017-08-07 03:08:37,808|INFO|MainThread|machine.py:184 - 
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Exit Code: 0{code}

  was:
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>" 
cli in a loop for 2-3 mins to check Log_aggreagtion status. 

I'm noticing that log_aggregation status remains in "RUNNING" and eventually 
ends up with "TIMED_OUT" status.  

This situation happens when an application has acquired a container but it is 
not launched on NM. 

This scenario should be better handled and should not cause this delay to get 
the application log. 


> Log collection fails when a container is acquired but not launched on NM
> ------------------------------------------------------------------------
>
>                 Key: YARN-7175
>                 URL: https://issues.apache.org/jira/browse/YARN-7175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Yesha Vora
>         Attachments: SparkApp.log
>
>
> Scenario:
> * Run Spark App
> * As soon as spark application finishes, Run "yarn application -status 
> <appID>" cli in a loop for 2-3 mins to check Log_aggreagtion status. 
> I'm noticing that log_aggregation status remains in "RUNNING" and eventually 
> ends up with "TIMED_OUT" status.  
> This situation happens when an application has acquired a container but it is 
> not launched on NM. 
> This scenario should be better handled and should not cause this delay to get 
> the application log. 
> Example: application_1502070770869_0012
> application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs 
> were not available till 2017-08-07 03:08:36.
> {code}
> RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
> application_1502070770869_0012
> 17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History 
> server at host5/xxx.xx.xx.xx:10200
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking 
> for the active RM in [rm1, rm2]...
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found 
> active RM [rm1]
> Application Report :
> Application-Id : application_1502070770869_0012
> Application-Name : ml.R
> Application-Type : SPARK
> User : hrt_qa
> Queue : default
> Application Priority : null
> Start-Time : 1502075166506
> Finish-Time : 1502075198997
> Progress : 100%
> State : FINISHED
> 2017-08-07 03:08:37,770|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Final-State : SUCCEEDED
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Tracking-URL : 
> ctr-e134-1499953498516-83705-01-000005.hwx.site:18080/history/application_1502070770869_0012/1
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|RPC Port : 0
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM Host : 172.27.21.204
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Aggregate Resource 
> Allocation : 174680 MB-seconds, 84 vcore-seconds
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Log Aggregation Status : 
> RUNNING
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Diagnostics :
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Unmanaged Application : false
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Application Node Label 
> Expression : <Not set>
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM container Node Label 
> Expression : <DEFAULT_PARTITION>
> 2017-08-07 03:08:37,808|INFO|MainThread|machine.py:184 - 
> run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Exit Code: 0{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-7175) Log collection fails when a container is acquired but not launched on NM

Reply via email to