[
https://issues.apache.org/jira/browse/YARN-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yesha Vora updated YARN-7175:
-----------------------------
Description:
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>"
cli in a loop for 2-3 mins to check Log_aggreagtion status.
I'm noticing that log_aggregation status remains in "RUNNING" and eventually
ends up with "TIMED_OUT" status.
This situation happens when an application has acquired a container but it is
not launched on NM.
This scenario should be better handled and should not cause this delay to get
the application log.
Example: application_1502070770869_0012
application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs were
not available till 2017-08-07 03:08:36.
{code}
RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status
application_1502070770869_0012
17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History
server at host5/xxx.xx.xx.xx:10200
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking
for the active RM in [rm1, rm2]...
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found
active RM [rm1]
Application Report :
Application-Id : application_1502070770869_0012
Application-Name : ml.R
Application-Type : SPARK
User : hrt_qa
Queue : default
Application Priority : null
Start-Time : 1502075166506
Finish-Time : 1502075198997
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED
Tracking-URL : host5:18080/history/application_1502070770869_0012/1
RPC Port : 0
AM Host : xxx.xx.xx.xx
Aggregate Resource Allocation : 174680 MB-seconds, 84 vcore-seconds
Log Aggregation Status : RUNNING
Diagnostics :
Unmanaged Application : false
Application Node Label Expression : <Not set>
AM container Node Label Expression : <DEFAULT_PARTITION>
{code}
was:
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>"
cli in a loop for 2-3 mins to check Log_aggreagtion status.
I'm noticing that log_aggregation status remains in "RUNNING" and eventually
ends up with "TIMED_OUT" status.
This situation happens when an application has acquired a container but it is
not launched on NM.
This scenario should be better handled and should not cause this delay to get
the application log.
Example: application_1502070770869_0012
application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs were
not available till 2017-08-07 03:08:36.
{code}
RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status
application_1502070770869_0012
17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History
server at host5/xxx.xx.xx.xx:10200
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking
for the active RM in [rm1, rm2]...
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found
active RM [rm1]
Application Report :
Application-Id : application_1502070770869_0012
Application-Name : ml.R
Application-Type : SPARK
User : hrt_qa
Queue : default
Application Priority : null
Start-Time : 1502075166506
Finish-Time : 1502075198997
Progress : 100%
State : FINISHED
2017-08-07 03:08:37,770|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Final-State : SUCCEEDED
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Tracking-URL :
ctr-e134-1499953498516-83705-01-000005.hwx.site:18080/history/application_1502070770869_0012/1
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|RPC Port : 0
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM Host : 172.27.21.204
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Aggregate Resource Allocation
: 174680 MB-seconds, 84 vcore-seconds
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Log Aggregation Status :
RUNNING
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Diagnostics :
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Unmanaged Application : false
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Application Node Label
Expression : <Not set>
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM container Node Label
Expression : <DEFAULT_PARTITION>
2017-08-07 03:08:37,808|INFO|MainThread|machine.py:184 -
run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Exit Code: 0{code}
> Log collection fails when a container is acquired but not launched on NM
> ------------------------------------------------------------------------
>
> Key: YARN-7175
> URL: https://issues.apache.org/jira/browse/YARN-7175
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Reporter: Yesha Vora
> Attachments: SparkApp.log
>
>
> Scenario:
> * Run Spark App
> * As soon as spark application finishes, Run "yarn application -status
> <appID>" cli in a loop for 2-3 mins to check Log_aggreagtion status.
> I'm noticing that log_aggregation status remains in "RUNNING" and eventually
> ends up with "TIMED_OUT" status.
> This situation happens when an application has acquired a container but it is
> not launched on NM.
> This scenario should be better handled and should not cause this delay to get
> the application log.
> Example: application_1502070770869_0012
> application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs
> were not available till 2017-08-07 03:08:36.
> {code}
> RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status
> application_1502070770869_0012
> 17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History
> server at host5/xxx.xx.xx.xx:10200
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking
> for the active RM in [rm1, rm2]...
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found
> active RM [rm1]
> Application Report :
> Application-Id : application_1502070770869_0012
> Application-Name : ml.R
> Application-Type : SPARK
> User : hrt_qa
> Queue : default
> Application Priority : null
> Start-Time : 1502075166506
> Finish-Time : 1502075198997
> Progress : 100%
> State : FINISHED
> Final-State : SUCCEEDED
> Tracking-URL : host5:18080/history/application_1502070770869_0012/1
> RPC Port : 0
> AM Host : xxx.xx.xx.xx
> Aggregate Resource Allocation : 174680 MB-seconds, 84 vcore-seconds
> Log Aggregation Status : RUNNING
> Diagnostics :
> Unmanaged Application : false
> Application Node Label Expression : <Not set>
> AM container Node Label Expression : <DEFAULT_PARTITION>
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]