[jira] [Resolved] (YARN-4770) Auto-restart of containers should work across NM restarts.

2016-11-18 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-4770.

Resolution: Not A Bug

> Auto-restart of containers should work across NM restarts.
> --
>
> Key: YARN-4770
> URL: https://issues.apache.org/jira/browse/YARN-4770
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> See my comment 
> [here|https://issues.apache.org/jira/browse/YARN-3998?focusedCommentId=15133367=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133367]
>  on YARN-3998. Need to take care of two things:
>  - The relaunch feature needs to work across NM restarts, so we should save 
> the retry-context and policy per container into the state-store and reload it 
> for continue relaunching after NM restart.
>  - We should also handle restarting of any containers that may have crashed 
> during the NM reboot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5372) TestRMWebServicesAppsModification fails in trunk

2016-07-13 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-5372.

Resolution: Not A Problem

> TestRMWebServicesAppsModification fails in trunk
> 
>
> Key: YARN-5372
> URL: https://issues.apache.org/jira/browse/YARN-5372
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Jun Gong
>
> Some test cases in TestRMWebServicesAppsModification fails in trunk:
> {code}
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[0]
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[0]
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[1]
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized[1]
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[1]
>
> {code}
> The test case errors are at 
> https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5372) TestRMWebServicesAppsModification fails in trunk

2016-07-13 Thread Jun Gong (JIRA)
Jun Gong created YARN-5372:
--

 Summary: TestRMWebServicesAppsModification fails in trunk
 Key: YARN-5372
 URL: https://issues.apache.org/jira/browse/YARN-5372
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Jun Gong


Some test cases in TestRMWebServicesAppsModification fails in trunk:

{code}
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[0]
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[0]
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppMove[1]
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized[1]
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testUpdateAppPriority[1]
 
{code}

The test case errors are at 
https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5333) apps are rejected when RM HA

2016-07-07 Thread Jun Gong (JIRA)
Jun Gong created YARN-5333:
--

 Summary: apps are rejected when RM HA
 Key: YARN-5333
 URL: https://issues.apache.org/jira/browse/YARN-5333
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


Enable RM HA and use FairScheduler.

Reproduce steps:
1. Start two RMs.
2. After RMs are running, change both RM's file 
{{etc/hadoop/fair-scheduler.xml}}, then add some queues.
3. Submit some apps to the new added queues.
4. Stop the active RM, then the standby RM will transit to active and recover 
apps.
However the new active RM will reject recovered apps because it might have not 
loaded the new {{fair-scheduler.xml}}. We need call {{initScheduler}} before 
start active services or bring {{refreshAll()}} in front of 
{{rm.transitionToActive()}}. *It seems it is aslo important for other 
scheduler*.

Related logs are as following:
{quote}
2016-07-07 16:55:34,756 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recover ended
...
2016-07-07 16:55:34,824 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Loading allocation file /gaia/hadoop/etc/hadoop/fair-scheduler.xml
2016-07-07 16:55:34,826 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application rejected by queue placement policy
2016-07-07 16:55:34,828 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application appattempt_1467803586002_0006_01 is done. finalState=FAILED
2016-07-07 16:55:34,828 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Unknown application appattempt_1467803586002_0006_01 has completed!
2016-07-07 16:55:34,828 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application rejected by queue placement policy
2016-07-07 16:55:34,828 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application appattempt_1467803586002_0004_01 is done. finalState=FAILED
2016-07-07 16:55:34,828 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Unknown application appattempt_1467803586002_0004_01 has completed!
2016-07-07 16:55:34,828 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
APP_REJECTED at ACCEPTED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:697)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:88)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:718)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:702)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
at java.lang.Thread.run(Thread.java:745)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5286) Add RPC port info in RM web service's response when getting app status

2016-06-21 Thread Jun Gong (JIRA)
Jun Gong created YARN-5286:
--

 Summary: Add RPC port info in RM web service's response when 
getting app status
 Key: YARN-5286
 URL: https://issues.apache.org/jira/browse/YARN-5286
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


When getting app status by RM web service({{/ws/v1/cluster/apps/\{appid\}}}), 
there is no RPC port info in the response. The port info is very important to 
communicate with AM.

BTW: there is RPC port info when running {{bin/yarn application -status appid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5168) Add port mapping handling when docker container use bridge network

2016-05-26 Thread Jun Gong (JIRA)
Jun Gong created YARN-5168:
--

 Summary: Add port mapping handling when docker container use 
bridge network
 Key: YARN-5168
 URL: https://issues.apache.org/jira/browse/YARN-5168
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jun Gong


YARN-4007 addresses different network setups when launching the docker 
container. We need support port mapping when docker container uses bridge 
network.

The following problems are what we faced:
1. Add "-P" to map docker container's exposed ports to automatically.
2. Add "-p" to let user specify specific ports to map.
3. Add service registry support for bridge network case, then app could find 
each other. It could be done out of YARN, however it might be more convenient 
to support it natively in YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5116) Failed to execute "yarn application"

2016-05-19 Thread Jun Gong (JIRA)
Jun Gong created YARN-5116:
--

 Summary: Failed to execute "yarn application"
 Key: YARN-5116
 URL: https://issues.apache.org/jira/browse/YARN-5116
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


Use the trunk code.
{code}
$ bin/yarn application -list
16/05/20 11:35:45 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.commons.cli.UnrecognizedOptionException: 
Unrecognized option: -list
at org.apache.commons.cli.Parser.processOption(Parser.java:363)
at org.apache.commons.cli.Parser.parse(Parser.java:199)
at org.apache.commons.cli.Parser.parse(Parser.java:85)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:172)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:90)
{code}

It is cause by that the subcommand 'application' is deleted from command args. 
The following command is OK.
{code}
$ bin/yarn application application -list
16/05/20 11:39:35 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, 
ACCEPTED, RUNNING]):0
Application-Id  Application-NameApplication-Type
  User   Queue   State Final-State  
   ProgressTracking-URL
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5063) Fail to launch AM continuously on a lost NM

2016-05-09 Thread Jun Gong (JIRA)
Jun Gong created YARN-5063:
--

 Summary: Fail to launch AM continuously on a lost NM
 Key: YARN-5063
 URL: https://issues.apache.org/jira/browse/YARN-5063
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


If a NM node shuts down, RM will not mark it as LOST until liveness monitor 
finds it timeout. However before that, RM might continuously allocate AM on 
that NM.

We found this case in our cluster: RM continuously allocated a same AM on a 
lost NM before RM found it lost, and AMLauncher always failed because it could 
not connect to the lost NM. To solve the problem, we could add the NM to AM 
blacklist if RM failed to launch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4910) Fix incomplete log info in ResourceLocalizationService

2016-04-01 Thread Jun Gong (JIRA)
Jun Gong created YARN-4910:
--

 Summary: Fix incomplete log info in ResourceLocalizationService
 Key: YARN-4910
 URL: https://issues.apache.org/jira/browse/YARN-4910
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Trivial


When debugging, find a lot of incomplete log info from 
ResourceLocalizationService, it is a little confusing.
{quote}
2016-03-30 22:47:29,703 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Writing credentials to the nmPrivate file 
/data6/yarnenv/local/nmPrivate/container_1456839788316_4159_01_04_37.tokens.
 Credentials list:
{quote}
The content of credentials list will only be printed for DEBUG log level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4735) Remove stale LogAggregationReport from NM's context

2016-02-24 Thread Jun Gong (JIRA)
Jun Gong created YARN-4735:
--

 Summary: Remove stale LogAggregationReport from NM's context
 Key: YARN-4735
 URL: https://issues.apache.org/jira/browse/YARN-4735
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


{quote}
All LogAggregationReport(current and previous) are only added to 
*context.getLogAggregationStatusForApps*, and never removed.

So for long running service, the LogAggregationReport list NM sends to RM will 
grow over time.
{quote}
Per discussion in YARN-4720, we need remove stale LogAggregationReport from 
NM's context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-22 Thread Jun Gong (JIRA)
Jun Gong created YARN-4497:
--

 Summary: RM might fail to restart when recovering apps whose 
attempts are missing
 Key: YARN-4497
 URL: https://issues.apache.org/jira/browse/YARN-4497
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


Find following problem when discussing in YARN-3480.

If RM fails to store some attempts in RMStateStore, there will be missing 
attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, 
RM successfully stored attempt1 and attempt3, but failed to store attempt2. 
When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for 
this case, we will recover attmept1, then attempt2. When recovering attempt2, 
we call  *((RMAppAttemptImpl)this.currentAttempt).recover(state)*,
 it will first find its ApplicationAttemptStateData, but it could not find it, 
an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, 
line 880).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4494) Recover completed apps asynchronously

2015-12-21 Thread Jun Gong (JIRA)
Jun Gong created YARN-4494:
--

 Summary: Recover completed apps asynchronously
 Key: YARN-4494
 URL: https://issues.apache.org/jira/browse/YARN-4494
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


With RM HA enabled, when recovering apps, recover completed apps asynchronously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4459) container-executor might kill process wrongly

2015-12-15 Thread Jun Gong (JIRA)
Jun Gong created YARN-4459:
--

 Summary: container-executor might kill process wrongly
 Key: YARN-4459
 URL: https://issues.apache.org/jira/browse/YARN-4459
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong


When calling 'signal_container_as_user' in container-executor, it first checks 
whether process group exists, if not, it will kill the process itself(if it the 
process exists).  It is not reasonable because that the process group does not 
exist means corresponding container has finished, if we kill the process 
itself, we just kill wrong process.

We found it happened in our cluster many times. We used same account for 
starting NM and submitted app, and container-executor sometimes killed NM(the 
wrongly killed process might just be a newly started thread and was NM's child 
process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4316) Make NM's version information useful for upgrade

2015-10-29 Thread Jun Gong (JIRA)
Jun Gong created YARN-4316:
--

 Summary: Make NM's version information useful for upgrade
 Key: YARN-4316
 URL: https://issues.apache.org/jira/browse/YARN-4316
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Minor


When upgrading all NM to a new bug fix version, we often upgrade some NM first, 
then upgrade rest NM if all looks right. We could avoid breakdown whole cluster 
in this way if new version of NM does not work well. But there is no easy way 
to tell us whether we have missed upgrading some NM.

We could see all NM's version info in RM's web page as attached. These version 
info are too generic, e.g. 2.4.1, 2.6.1, 2.6.2. For small bug fix version, 
version will remain same. If we could change the version info more 
detailed(e.g. 2.4.1.12), we could make sure whether we have upgrade all NM to 
the new bug fix version.

I propose to add a new config(yarn.nodemanager.version) in yarn-site.xml to 
solve this problem. When upgrading NM, we configure it to the new version at 
the same time. NM will report this version to RM, then we could see it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4316) Make NM's version information useful for upgrade

2015-10-29 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-4316.

Resolution: Implemented

> Make NM's version information useful for upgrade
> 
>
> Key: YARN-4316
> URL: https://issues.apache.org/jira/browse/YARN-4316
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Minor
> Attachments: nodes.png
>
>
> When upgrading all NM to a new bug fix version, we often upgrade some NM 
> first, then upgrade rest NM if all looks right. We could avoid breakdown 
> whole cluster in this way if new version of NM does not work well. But there 
> is no easy way to tell us whether we have missed upgrading some NM.
> We could see all NM's version info in RM's web page as attached. These 
> version info are too generic, e.g. 2.4.1, 2.6.1, 2.6.2. For small bug fix 
> version, version will remain same. If we could change the version info more 
> detailed(e.g. 2.4.1.12), we could make sure whether we have upgrade all NM to 
> the new bug fix version.
> I propose to add a new config(yarn.nodemanager.version) in yarn-site.xml to 
> solve this problem. When upgrading NM, we configure it to the new version at 
> the same time. NM will report this version to RM, then we could see it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4201) AMBlacklist does not work for minicluster

2015-09-23 Thread Jun Gong (JIRA)
Jun Gong created YARN-4201:
--

 Summary: AMBlacklist does not work for minicluster
 Key: YARN-4201
 URL: https://issues.apache.org/jira/browse/YARN-4201
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


For minicluster (scheduler.include-port-in-node-name is set to TRUE), 
AMBlacklist does not work. It is because RM just puts host to AMBlacklist 
whether scheduler.include-port-in-node-name is set or not. In fact RM should 
put "host + port" to AMBlacklist when it is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4122) Add support for GPU as a resource

2015-09-06 Thread Jun Gong (JIRA)
Jun Gong created YARN-4122:
--

 Summary: Add support for GPU as a resource
 Key: YARN-4122
 URL: https://issues.apache.org/jira/browse/YARN-4122
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong


Use [cgroups 
devcies|https://www.kernel.org/doc/Documentation/cgroups/devices.txt] to 
isolate GPUs for containers. For docker containers, we could use 'docker run 
--device=...'.

Reference: [SLURM Resources isolation through 
cgroups|http://slurm.schedmd.com/slurm_ug_2011/SLURM_UserGroup2011_cgroups.pdf].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-09-02 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3998.

Resolution: Won't Fix

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4005) Completed container whose app is finished is not removed from NMStateStore

2015-07-31 Thread Jun Gong (JIRA)
Jun Gong created YARN-4005:
--

 Summary: Completed container whose app is finished is not removed 
from NMStateStore
 Key: YARN-4005
 URL: https://issues.apache.org/jira/browse/YARN-4005
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Assignee: Jun Gong


If a container is completed and its corresponding app is finished, NM only 
removes it from its context and does not remove it from NMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-07-30 Thread Jun Gong (JIRA)
Jun Gong created YARN-3998:
--

 Summary: Add retry-times to let NM re-launch container when it 
fails to run
 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong


I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
launches containers, it could specify the value. Then NM will re-launch the 
container 'retry-times' times when it fails to run(e.g.exit code is not 0). 

It will save a lot of time. It avoids container localization. RM does not need 
to re-schedule the container. And local files in container's working directory 
will be left for re-use.(If container have downloaded some big files, it does 
not need to re-download them when running again.) 

We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset

2015-07-08 Thread Jun Gong (JIRA)
Jun Gong created YARN-3896:
--

 Summary: RMNode transitioned from RUNNING to REBOOTED because its 
response id had not been reset
 Key: YARN-3896
 URL: https://issues.apache.org/jira/browse/YARN-3896
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


{noformat}
2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 
10.208.132.153 to /default-rack
2015-07-03 16:49:39,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect 
from the node at: 10.208.132.153
2015-07-03 16:49:39,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered 
with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 
10.208.132.153:8041
2015-07-03 16:49:39,104 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far 
behind rm response id:2506413 nm response id:0
2015-07-03 16:49:39,137 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node 10.208.132.153:8041 as it is now REBOOTED
2015-07-03 16:49:39,137 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it

2015-06-23 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3831.

Resolution: Not A Problem

 Localization failed when a local disk turns from bad to good without NM 
 initializes it
 --

 Key: YARN-3831
 URL: https://issues.apache.org/jira/browse/YARN-3831
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong

 A local disk turns from bad to good without NM initializes it(create 
 /path-to-local-dir/usercache and /path-to-local-dir/filecache). When 
 localizing a container, container-executor will try to create directories 
 under /path-to-local-dir/usercache, and it will fail. Then container's 
 localization will fail. 
 Related log is as following:
 {noformat}
 2015-06-19 18:00:01,205 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1431957472783_38706012_01_000465
 2015-06-19 18:00:01,212 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Writing credentials to the nmPrivate file 
 /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens.
  Credentials list: 
 2015-06-19 18:00:01,216 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1431957472783_38706012_01_000465 startLocalizer is : 
 20
 org.apache.hadoop.util.Shell$ExitCodeException: 
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
 2015-06-19 18:00:01,216 INFO 
 org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command 
 provided 0
 2015-06-19 18:00:01,216 INFO 
 org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
 tdwadmin
 2015-06-19 18:00:01,216 INFO 
 org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create 
 directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory
 2015-06-19 18:00:01,216 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.IOException: Application application_1431957472783_38706012 
 initialization failed (exitCode=20) with output: main : command provided 0
 main : user is tdwadmin
 Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such 
 file or directory
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
 Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
 ... 1 more
 2015-06-19 18:00:01,216 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1431957472783_38706012_01_000465 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3833) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk

2015-06-19 Thread Jun Gong (JIRA)
Jun Gong created YARN-3833:
--

 Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
in trunk
 Key: YARN-3833
 URL: https://issues.apache.org/jira/browse/YARN-3833
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong


{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 282.811 sec 
 FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
  Time elapsed: 6.445 sec   FAILURE!
java.lang.AssertionError: expected:6144 but was:8192
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it

2015-06-19 Thread Jun Gong (JIRA)
Jun Gong created YARN-3831:
--

 Summary: Localization failed when a local disk turns from bad to 
good without NM initializes it
 Key: YARN-3831
 URL: https://issues.apache.org/jira/browse/YARN-3831
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong


A local disk turns from bad to good without NM initializes it(create 
/path-to-local-dir/usercache and /path-to-local-dir/filecache). When localizing 
a container, container-executor will try to create directories under 
/path-to-local-dir/usercache, and it will fail. Then container's localization 
will fail. 

Related log is as following:
{noformat}
2015-06-19 18:00:01,205 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Created localizer for container_1431957472783_38706012_01_000465
2015-06-19 18:00:01,212 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Writing credentials to the nmPrivate file 
/data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens.
 Credentials list: 
2015-06-19 18:00:01,216 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
from container container_1431957472783_38706012_01_000465 startLocalizer is : 20
org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
2015-06-19 18:00:01,216 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command 
provided 0
2015-06-19 18:00:01,216 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
tdwadmin
2015-06-19 18:00:01,216 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create 
directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory
2015-06-19 18:00:01,216 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed
java.io.IOException: Application application_1431957472783_38706012 
initialization failed (exitCode=20) with output: main : command provided 0
main : user is tdwadmin
Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such 
file or directory

at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205)
... 1 more
2015-06-19 18:00:01,216 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1431957472783_38706012_01_000465 transitioned from 
LOCALIZING to LOCALIZATION_FAILED
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3833) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk

2015-06-19 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3833.

Resolution: Duplicate

 TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk
 

 Key: YARN-3833
 URL: https://issues.apache.org/jira/browse/YARN-3833
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong

 {noformat}
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 282.811 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.445 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-15 Thread Jun Gong (JIRA)
Jun Gong created YARN-3809:
--

 Summary: Failed to launch new attempts because 
ApplicationMasterLauncher's threads all hang
 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
AMLauncherEventType(LAUNCH and CLEANUP).

In our cluster, there was many NM with 10+ AM running on it, and one shut down 
for some reason. After RM found the NM LOST, it cleaned up AMs running on it. 
Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, 
the default RPC time out is 15 mins. It means that in 15 mins 
ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers

2015-05-04 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3474.

Resolution: Invalid

 Add a way to let NM wait RM to come back, not kill running containers
 -

 Key: YARN-3474
 URL: https://issues.apache.org/jira/browse/YARN-3474
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3474.01.patch


 When RM HA is enabled and active RM shuts down, standby RM will become 
 active, recover apps and attempts. Apps will not be affected. 
 If there are some cases or bugs that cause both RM could not start 
 normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM 
 could not connect with ZK well). NM will kill containers running on it when  
 it could not heartbeat with RM for some time(max retry time is 15 mins by 
 default). Then all apps will be killed. 
 In production cluster, we might come across above cases and fixing these bugs 
 might need time more than 15 mins. In order to let apps not be affected and 
 killed by NM, YARN admin could set a flag(the flag is a znode 
 '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to 
 come back and not kill running containers. After fixing bugs and RM start 
 normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3480) Make AM max attempts stored in RMStateStore to be configurable

2015-04-13 Thread Jun Gong (JIRA)
Jun Gong created YARN-3480:
--

 Summary: Make AM max attempts stored in RMStateStore to be 
configurable
 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong


When RM HA is enabled and running containers are kept across attempts, apps are 
more likely to finish successfully with more retries(attempts), so it will be 
better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will 
make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover 
process much slower. It might be better to set max attempts to be stored in 
RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3469) Do not set watch for most cases in ZKRMStateStore

2015-04-09 Thread Jun Gong (JIRA)
Jun Gong created YARN-3469:
--

 Summary: Do not set watch for most cases in ZKRMStateStore
 Key: YARN-3469
 URL: https://issues.apache.org/jira/browse/YARN-3469
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Minor


In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, 
getDataWithRetries) set watches on znode. Large watches will cause problem such 
as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment 
to fail](https://issues.apache.org/jira/browse/ZOOKEEPER-706).  

Although there is a workaround that setting jute.maxbuffer to a larger value, 
we need to adjust this value once there are more app and attempts stored in ZK. 
And those watches are useless now. It might be better that do not set watches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3389) Two attempts might operate on same data structures concurrently

2015-03-23 Thread Jun Gong (JIRA)
Jun Gong created YARN-3389:
--

 Summary: Two attempts might operate on same data structures 
concurrently
 Key: YARN-3389
 URL: https://issues.apache.org/jira/browse/YARN-3389
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong


In AttemptFailedTransition, the new attempt will get 
state('justFinishedContainers' and 'finishedContainersSentToAM') reference from 
the failed attempt. Then the two attempts might operate on these two variables 
concurrently, e.g. they might update 'justFinishedContainers' concurrently when 
they are both handling CONTAINER_FINISHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3161) Containers' information are lost in some cases when RM restart

2015-02-09 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3161.

Resolution: Duplicate

 Containers' information are lost in some cases when RM restart
 --

 Key: YARN-3161
 URL: https://issues.apache.org/jira/browse/YARN-3161
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong

 When RM restart, containers' information will be lost for the following 
 scenarios:
 1. NM restarts before it sends containers' information to the new active RM. 
 2. NM stops and it could not send containers' information to the new active 
 RM.
 Without those containers' information, corresponding AM will never get their 
 status through RM, and AM would just wait them for ever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3161) Containers' information are lost in some cases when RM restart

2015-02-09 Thread Jun Gong (JIRA)
Jun Gong created YARN-3161:
--

 Summary: Containers' information are lost in some cases when RM 
restart
 Key: YARN-3161
 URL: https://issues.apache.org/jira/browse/YARN-3161
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong


When RM restart, containers' information will be lost for the following 
scenarios:
1. NM restarts before it sends containers' information to the new active RM. 
2. NM stops and it could send containers' information to the new active RM.

Without those containers' information, corresponding AM will never get their 
status through RM, and AM could just wait them for ever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3094) reset timer for liveness monitors after RM recovery

2015-01-23 Thread Jun Gong (JIRA)
Jun Gong created YARN-3094:
--

 Summary: reset timer for liveness monitors after RM recovery
 Key: YARN-3094
 URL: https://issues.apache.org/jira/browse/YARN-3094
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong


When RM restarts, it will recover RMAppAttempts and registry them to 
AMLivenessMonitor if they are not in final state. AM will time out in RM if the 
recover process takes long time due to some reasons(e.g. too many apps). 

In our system, we found the recover process took about 3 mins, and all AM time 
out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3057) Need update apps' runnability when reloading allocation files for FairScheduler

2015-01-13 Thread Jun Gong (JIRA)
Jun Gong created YARN-3057:
--

 Summary: Need update apps' runnability when reloading allocation 
files for FairScheduler
 Key: YARN-3057
 URL: https://issues.apache.org/jira/browse/YARN-3057
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


If we submit a app and the number of running app in its corresponding leaf 
queue has reached its max limit, the app will be put into 'nonRunnableApps'. 
And its runnabiltiy will only be updated when removing a 
appattempt(FairScheduler will call `updateRunnabilityOnAppRemoval` at that 
time).

Suppose there are only service apps running, they will not finish, so the 
submitted app will not be scheduled even we change leaf queue's max limit. I 
think we need update apps' runnability when reloading allocation files for 
FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed

2014-10-02 Thread Jun Gong (JIRA)
Jun Gong created YARN-2640:
--

 Summary: TestDirectoryCollection.testCreateDirectories failed
 Key: YARN-2640
 URL: https://issues.apache.org/jira/browse/YARN-2640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong


When running test mvn test -Dtest=TestDirectoryCollection, it failed:
{code}
Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec  
FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection)
  Time elapsed: 0.969 sec   FAILURE!
java.lang.AssertionError: local dir parent not created with proper permissions 
expected:rwxr-xr-x but was:rwxrwxr-x
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at 
org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104)
{code}

I found it was because testDiskSpaceUtilizationLimit ran before 
testCreateDirectories when running test, then directory dirA was created in 
test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried 
to create dirA with specified permission, it found dirA has already been 
there and it did nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2612) Some completed containers are not reported to NM

2014-10-02 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-2612.

Resolution: Duplicate

 Some completed containers are not reported to NM
 

 Key: YARN-2612
 URL: https://issues.apache.org/jira/browse/YARN-2612
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
 Fix For: 2.6.0


 We are testing RM work preserving restart and found the following logs when 
 we ran a simple MapReduce task PI. Some completed containers which already 
 pulled by AM never reported back to NM, so NM continuously report the 
 completed containers while AM had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In YARN-1372, NM will report completed containers to RM until it gets ACK 
 from RM.  If AM does not call allocate, which means that AM does not ack RM, 
 RM will not ack NM. We([~chenchun]) have observed these two cases when 
 running Mapreduce task 'pi':
 1) RM sends completed containers to AM. After receiving it, AM thinks it has 
 done the work and does not need resource, so it does not call allocate.
 2) When AM finishes, it could not ack to RM because AM itself has not 
 finished yet.
 We think when RMAppAttempt call BaseFinalTransition, it means AppAttempt 
 finishes, then RM could send this AppAttempt's completed containers to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-28 Thread Jun Gong (JIRA)
Jun Gong created YARN-2617:
--

 Summary: NM does not need to send finished container whose APP is 
not running to RM
 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
 Fix For: 2.6.0


We([~chenchun]) are testing RM work preserving restart and found the following 
logs when we ran a simple MapReduce task PI. NM continuously reported 
completed containers whose Application had already finished while AM had 
finished. 
{code}
2014-09-26 17:00:42,228 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:42,228 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:43,230 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:43,230 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:44,233 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:44,233 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
{code}

In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean up 
already completed applications. But it will only remove appId from 
{code}app.context.getApplications(){code} when ApplicaitonImpl received evnet 
{code}ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED{code} , however NM 
might receive this event for a long time or could not receive. 
* For NonAggregatingLogHandler, it wait for 
YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
then it will be scheduled to delete Application logs and send the event.
* For LogAggregationService, it might fail(e.g. if user does not have HDFS 
write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2014-06-17 Thread Jun Gong (JIRA)
Jun Gong created YARN-2170:
--

 Summary: Fix components' version information in the web page 
'About the Cluster'
 Key: YARN-2170
 URL: https://issues.apache.org/jira/browse/YARN-2170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Priority: Minor


In the web page 'About the Cluster', YARN's component's build version(e.g. 
ResourceManager) is the same as Hadoop version now. It is caused by   calling 
getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2164) Add switch 'restart' for yarn-daemon.sh

2014-06-16 Thread Jun Gong (JIRA)
Jun Gong created YARN-2164:
--

 Summary: Add switch 'restart'  for yarn-daemon.sh
 Key: YARN-2164
 URL: https://issues.apache.org/jira/browse/YARN-2164
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jun Gong
Priority: Minor


For convenience, add an switch 'restart' for yarn-daemon.sh. 

e.g. We could use yarn-daemon.sh restart nodemanager  instead of 
yarn-daemon.sh stop nodemanager;  yarn-daemon.sh start nodemanager.



--
This message was sent by Atlassian JIRA
(v6.2#6252)