date:20130328


 [ 
https://issues.apache.org/jira/browse/YARN-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-193:
-

Attachment: YARN-193.8.patch

Had offline discussion with Bikas and Hitesh. We agreed to simplify the 
solution, and isolate it from the fix of YARN-382.

 Scheduler.normalizeRequest does not account for allocation requests that 
 exceed maximumAllocation limits 
 -

 Key: YARN-193
 URL: https://issues.apache.org/jira/browse/YARN-193
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 3.0.0
Reporter: Hitesh Shah
Assignee: Zhijie Shen
 Attachments: MR-3796.1.patch, MR-3796.2.patch, MR-3796.3.patch, 
 MR-3796.wip.patch, YARN-193.4.patch, YARN-193.5.patch, YARN-193.6.patch, 
 YARN-193.7.patch, YARN-193.8.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

[
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325
]

Robert Joseph Evans commented on YARN-112:
--

I agree that scale exposes races but, still the underlying problem is that we
want to create a new unique directory. This seems very simple.

{code}
File uniqueDir = null;
do {
uniqueDir = new File(baseDir, String.valueOf(rand.nextLong()));
} while (!uniqueDir.mkdir());
{code}

I don't see why we are going through all of this complexity, simply because a
FileContext API is broken. Playing games to make the race less likely is fine.
But ultimately we still have to handle the race.

Race in localization can cause containers to fail
-

Key: YARN-112
URL: https://issues.apache.org/jira/browse/YARN-112
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch,
yarn-112-20130326.patch, yarn-112.20131503.patch

On one of our 0.23 clusters, I saw a case of two containers, corresponding to
two map tasks of a MR job, that were launched almost simultaneously on the
same node. It appears they both tried to localize job.jar and job.xml at the
same time. One of the containers failed when it couldn't rename the
temporary job.jar directory to its final name because the target directory
wasn't empty. Shortly afterwards the second container failed because job.xml
could not be found, presumably because the first container removed it when it
cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail


[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327
 ] 

Robert Joseph Evans commented on YARN-112:
--

Oh and the latest patch using a unique number will not always work, because the 
same code is used from different processes on the same box.  We would have to 
have a way to guarantee uniqueness between the different processes.  
CurrentTimeMillis helps but still could result in a race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-512) Log aggregation root directory check is more expensive than it needs to be

2013-03-28 Thread Jason Lowe (JIRA)

Jason Lowe created YARN-512:
---

 Summary: Log aggregation root directory check is more expensive 
than it needs to be
 Key: YARN-512
 URL: https://issues.apache.org/jira/browse/YARN-512
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-beta
Reporter: Jason Lowe
Priority: Minor


The log aggregation root directory check first does an {{exists}} call followed 
by a {{getFileStatus}} call.  That effectively stats the file twice.  It should 
just use {{getFileStatus}} and catch {{FileNotFoundException}} to handle the 
non-existent case.

In addition we may consider caching the presence of the directory rather than 
checking it each time a node aggregates logs for an application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-509) ResourceTrackerPB misses KerberosInfo annotation which renders YARN unusable on secure clusters

2013-03-28 Thread Roman Shaposhnik (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616415#comment-13616415
 ] 

Roman Shaposhnik commented on YARN-509:
---

I totally agree that it needs to be investigated. That said, if we have to rush 
2.0.4-alpha I'd say the proposed patch might be a reasonable workaround.

 ResourceTrackerPB misses KerberosInfo annotation which renders YARN unusable 
 on secure clusters
 ---

 Key: YARN-509
 URL: https://issues.apache.org/jira/browse/YARN-509
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
 Environment: BigTop Kerberized cluster test environment
Reporter: Konstantin Boudnik
Priority: Blocker
 Fix For: 3.0.0, 2.0.4-alpha

 Attachments: YARN-509.patch.txt


 During BigTop 0.6.0 release test cycle, [~rvs] came around the following 
 problem:
 {noformat}
 013-03-26 15:37:03,573 FATAL
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting
 NodeManager
 org.apache.hadoop.yarn.YarnException: Failed to Start
 org.apache.hadoop.yarn.server.nodemanager.NodeManager
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:359)
 Caused by: org.apache.avro.AvroRuntimeException:
 java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:162)
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
 ... 3 more
 Caused by: java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:128)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:61)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:158)
 ... 4 more
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
 User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not
 authorized for protocol interface
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB, expected client
 Kerberos principal is yarn/ip-10-46-37-244.ec2.internal@BIGTOP
 at org.apache.hadoop.ipc.Client.call(Client.java:1235)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at $Proxy26.registerNodeManager(Unknown Source)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
 ... 6 more
 {noformat}
 The most significant part is 
 {{User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not 
 authorized for protocol interface  
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB}} indicating that 
 ResourceTrackerPB hasn't been annotated with {{@KerberosInfo}} nor 
 {{@TokenInfo}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

[
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616447#comment-13616447
]

Vinod Kumar Vavilapalli commented on YARN-112:
--

bq. Playing games to make the race less likely is fine. But ultimately we still
have to handle the race.
bq. Oh and the latest patch using a unique number will not always work, because
the same code is used from different processes on the same box.
Bobby, the unique number generation is done in one single process and
communicated down. ResourceTrackerService (NodeManager process) generates the
unique path and passes it down to FSDownload (Localizer process), so we can
avoid the race altogether.

Race in localization can cause containers to fail
-

[jira] [Updated] (YARN-450) Define value for * in the scheduling protocol


 [ 
https://issues.apache.org/jira/browse/YARN-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-450:
-

Attachment: YARN-450_7.patch

isAnyLocation returns boolean instead of Boolean now.

 Define value for * in the scheduling protocol
 -

 Key: YARN-450
 URL: https://issues.apache.org/jira/browse/YARN-450
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Zhijie Shen
 Attachments: YARN-450_1.patch, YARN-450_2.patch, YARN-450_3.patch, 
 YARN-450_4.patch, YARN-450_5.patch, YARN-450_6.patch, YARN-450_7.patch


 The ResourceRequest has a string field to specify node/rack locations. For 
 the cross-rack/cluster-wide location (ie when there is no locality 
 constraint) the * string is used everywhere. However, its not defined 
 anywhere and each piece of code either defines a local constant or uses the 
 string literal. Defining * in the protocol and removing other local 
 references from the code base will be good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-248) Restore RMDelegationTokenSecretManager state on restart


 [ 
https://issues.apache.org/jira/browse/YARN-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-248:


Assignee: Bikas Saha

 Restore RMDelegationTokenSecretManager state on restart
 ---

 Key: YARN-248
 URL: https://issues.apache.org/jira/browse/YARN-248
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tom White
Assignee: Bikas Saha

 On restart, the RM creates a new RMDelegationTokenSecretManager with fresh 
 state. This will cause problems for Oozie jobs running on secure clusters 
 since the delegation tokens stored in the job credentials (used by the Oozie 
 launcher job to submit a job to the RM) will not be recognized by the RM, and 
 recovery will fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-513) Verify all clients will wait for RM to restart

Bikas Saha created YARN-513:
---

 Summary: Verify all clients will wait for RM to restart
 Key: YARN-513
 URL: https://issues.apache.org/jira/browse/YARN-513
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: jian he


When the RM is restarting, the NM, AM and Clients should wait for some time for 
the RM to come back up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-514) Delayed store operations should not result in RM unavailability for app submission

Bikas Saha created YARN-514:
---

 Summary: Delayed store operations should not result in RM 
unavailability for app submission
 Key: YARN-514
 URL: https://issues.apache.org/jira/browse/YARN-514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Bikas Saha


Currently, app submission is the only store operation performed synchronously 
because the app must be stored before the request returns with success. This 
makes the RM susceptible to blocking all client threads on slow store 
operations, resulting in RM being perceived as unavailable by clients.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-437) Update documentation of Writing Yarn Applications to match current best practices

2013-03-28 Thread Eli Reisman (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated YARN-437:
-

Attachment: YARN-437-3.patch

Added a couple small points I missed before, and fixed (I hope) some 
formatting. Thanks. If anyone notices any misuses of the document formatting 
please let me know. Thanks!

 Update documentation of Writing Yarn Applications to match current best 
 practices
 ---

 Key: YARN-437
 URL: https://issues.apache.org/jira/browse/YARN-437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Hitesh Shah
Assignee: Eli Reisman
 Attachments: YARN-437-1.patch, YARN-437-2.patch, YARN-437-3.patch


 Should fix docs to point to usage of YarnClient and AMRMClient helper libs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits


[ 
https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616558#comment-13616558
 ] 

Vinod Kumar Vavilapalli commented on YARN-467:
--

Another thing I've been looking hard is to see if 
LocalResourceTracker.localizationCompleted() can be done away with completely 
in favour of the handle() method. But to do that we need to handle both 
successful and failing localizations via handle(). I can already see a couple 
of bugs related to localization failures, so let's do this separately.

 Jobs fail during resource localization when public distributed-cache hits 
 unix directory limits
 ---

 Key: YARN-467
 URL: https://issues.apache.org/jira/browse/YARN-467
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.0.0-alpha
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch, 
 yarn-467-20130322.3.patch, yarn-467-20130322.patch, 
 yarn-467-20130325.1.patch, yarn-467-20130325.path


 If we have multiple jobs which uses distributed cache with small size of 
 files, the directory limit reaches before reaching the cache size and fails 
 to create any directories in file cache (PUBLIC). The jobs start failing with 
 the below exception.
 java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 
 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
   at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 we need to have a mechanism where in we can create directory hierarchy and 
 limit number of files per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-515) Node Manager not getting the master key

Robert Joseph Evans created YARN-515:


 Summary: Node Manager not getting the master key
 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker


On branch-2 the latest version I see the following on a secure cluster.

{noformat}
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
enabled - updating secret keys now
2013-03-28 19:21:06,243 [main] INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
with ResourceManager as RM:PORT with total resource of me
mory:12288, vCores:16
2013-03-28 19:21:06,244 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
started.
2013-03-28 19:21:06,245 [main] INFO 
org.apache.hadoop.yarn.service.AbstractService: 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
{noformat}

The Null pointer exception just keeps repeating and all of the nodes end up 
being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-450) Define value for * in the scheduling protocol


[ 
https://issues.apache.org/jira/browse/YARN-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616584#comment-13616584
 ] 

Bikas Saha commented on YARN-450:
-

+1. Committed to trunk and branch-2. Thanks Zhijie!

 Define value for * in the scheduling protocol
 -

 Key: YARN-450
 URL: https://issues.apache.org/jira/browse/YARN-450
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Zhijie Shen
 Attachments: YARN-450_1.patch, YARN-450_2.patch, YARN-450_3.patch, 
 YARN-450_4.patch, YARN-450_5.patch, YARN-450_6.patch, YARN-450_7.patch


 The ResourceRequest has a string field to specify node/rack locations. For 
 the cross-rack/cluster-wide location (ie when there is no locality 
 constraint) the * string is used everywhere. However, its not defined 
 anywhere and each piece of code either defines a local constant or uses the 
 string literal. Defining * in the protocol and removing other local 
 references from the code base will be good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-515) Node Manager not getting the master key


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616628#comment-13616628
 ] 

Robert Joseph Evans commented on YARN-515:
--

OK It actually looks like the NM is trying to get the Master Key, before it 
ever has set it, which is causing the NPE.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-509) $var shell substitution in properties are not expanded in hadoop-policy.xml

2013-03-28 Thread Roman Shaposhnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Shaposhnik updated YARN-509:
--

Summary: $var shell substitution in properties are not expanded in 
hadoop-policy.xml  (was: ResourceTrackerPB misses KerberosInfo annotation which 
renders YARN unusable on secure clusters)

 $var shell substitution in properties are not expanded in hadoop-policy.xml
 ---

 Key: YARN-509
 URL: https://issues.apache.org/jira/browse/YARN-509
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
 Environment: BigTop Kerberized cluster test environment
Reporter: Konstantin Boudnik
Priority: Blocker
 Fix For: 3.0.0, 2.0.4-alpha

 Attachments: YARN-509.patch.txt


 During BigTop 0.6.0 release test cycle, [~rvs] came around the following 
 problem:
 {noformat}
 013-03-26 15:37:03,573 FATAL
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting
 NodeManager
 org.apache.hadoop.yarn.YarnException: Failed to Start
 org.apache.hadoop.yarn.server.nodemanager.NodeManager
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:359)
 Caused by: org.apache.avro.AvroRuntimeException:
 java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:162)
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
 ... 3 more
 Caused by: java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:128)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:61)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:158)
 ... 4 more
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
 User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not
 authorized for protocol interface
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB, expected client
 Kerberos principal is yarn/ip-10-46-37-244.ec2.internal@BIGTOP
 at org.apache.hadoop.ipc.Client.call(Client.java:1235)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at $Proxy26.registerNodeManager(Unknown Source)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
 ... 6 more
 {noformat}
 The most significant part is 
 {{User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not 
 authorized for protocol interface  
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB}} indicating that 
 ResourceTrackerPB hasn't been annotated with {{@KerberosInfo}} nor 
 {{@TokenInfo}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-509) $var shell substitution in properties are not expanded in hadoop-policy.xml

2013-03-28 Thread Roman Shaposhnik (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616685#comment-13616685
 ] 

Roman Shaposhnik commented on YARN-509:
---

Guys, I've updated the the description of the JIRA to be better reflect the 
latest findings. I'm leaving it as a blocker for now expecting somebody else to 
chime in and propose whether we apply a patch I provide or RELNOTE this if 
there's not enough time to get to the bottom of the issue.

 $var shell substitution in properties are not expanded in hadoop-policy.xml
 ---

 Key: YARN-509
 URL: https://issues.apache.org/jira/browse/YARN-509
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
 Environment: BigTop Kerberized cluster test environment
Reporter: Konstantin Boudnik
Priority: Blocker
 Fix For: 3.0.0, 2.0.4-alpha

 Attachments: YARN-509.patch.txt


 During BigTop 0.6.0 release test cycle, [~rvs] came around the following 
 problem:
 {noformat}
 013-03-26 15:37:03,573 FATAL
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting
 NodeManager
 org.apache.hadoop.yarn.YarnException: Failed to Start
 org.apache.hadoop.yarn.server.nodemanager.NodeManager
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:359)
 Caused by: org.apache.avro.AvroRuntimeException:
 java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:162)
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
 ... 3 more
 Caused by: java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:128)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:61)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:158)
 ... 4 more
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
 User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not
 authorized for protocol interface
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB, expected client
 Kerberos principal is yarn/ip-10-46-37-244.ec2.internal@BIGTOP
 at org.apache.hadoop.ipc.Client.call(Client.java:1235)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at $Proxy26.registerNodeManager(Unknown Source)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
 ... 6 more
 {noformat}
 The most significant part is 
 {{User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not 
 authorized for protocol interface  
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB}} indicating that 
 ResourceTrackerPB hasn't been annotated with {{@KerberosInfo}} nor 
 {{@TokenInfo}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-515) Node Manager not getting the master key


[ 
https://issues.apache.org/jira/browse/YARN-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616698#comment-13616698
 ] 

Robert Joseph Evans commented on YARN-515:
--

This is really odd.  I put in logging in the ResourceTrackerService and in the 
NodeStatusUpdaterImpl.  The RM sets the secret key in the 
RegisterNodeManagerResponse, but the NM only sees a null come out for it.  
Because of that the heartbeat always fails with the NPE trying to read 
something that was never set.

 Node Manager not getting the master key
 ---

 Key: YARN-515
 URL: https://issues.apache.org/jira/browse/YARN-515
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.4-alpha
Reporter: Robert Joseph Evans
Priority: Blocker

 On branch-2 the latest version I see the following on a secure cluster.
 {noformat}
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security 
 enabled - updating secret keys now
 2013-03-28 19:21:06,243 [main] INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
 with ResourceManager as RM:PORT with total resource of me
 mory:12288, vCores:16
 2013-03-28 19:21:06,244 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
 started.
 2013-03-28 19:21:06,245 [main] INFO 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
 2013-03-28 19:21:07,257 [Node Status Updater] ERROR 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
 exception in status-updater
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
 {noformat}
 The Null pointer exception just keeps repeating and all of the nodes end up 
 being lost.  It looks like it never gets the secret key when it registers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

[
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616708#comment-13616708
]

Zhijie Shen commented on YARN-276:
--

IMO, the essential problem is that maxActiveApplications is a loose bound. See
the formular bellow.

1. clusterResource * maximumApplicationMasterResourcePercent = minAllocation *
maxActiveApplications.

maxActiveApplications is computed by assuming each application only requires
minAllocation. In fact, AM container may require more. Therefore,

2. clusterResource * maximumApplicationMasterResourcePercent = minAllocation *
maxActiveApplications = (minAllocation_1 + minAllocation_2 + ... +
minAllocation_k) = (requestedResource_1 + requestedResource_2 + ... +
minAllocation_k), where k = maxActiveApplications.

Hence when maxActiveApplications applications are activated and they require
more than minAllocation resource, such that more than
maximumApplicationMasterResourcePercent of clusterResource may be used by AMs,
and even clusterResource is likely to be exceeded.

@nemon's solution looks good, which is actually a more restrict bound of the
max allowed active applications. Whenever an application is to be activated,
the following criteria is checked.

3. clusterResource * maximumApplicationMasterResourcePercent -
ApplicationMasterResource = requestedResource.

The issue here is that when this criteria is met, maxActiveApplications should
be met as well, because this one is more restricted. So instead of add the new
criteria, how about replacing maxActiveApplications with it?

Capacity Scheduler can hang when submit many jobs concurrently
--

Key: YARN-276
URL: https://issues.apache.org/jira/browse/YARN-276
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 3.0.0, 2.0.1-alpha
Reporter: nemon lou
Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch,
YARN-276.patch, YARN-276.patch, YARN-276.patch

Original Estimate: 24h
Remaining Estimate: 24h

In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity
scheduler can hang with most resources taken up by AM and don't have enough
resources for tasks.And then all applications hang there.
The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not
check directly.Instead ,this property only used for maxActiveApplications.
And maxActiveApplications is computed by minimumAllocation (not by Am
actually used).

[jira] [Assigned] (YARN-475) Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in an AM's environment

2013-03-28 Thread Hitesh Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah reassigned YARN-475:


Assignee: Hitesh Shah

 Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in 
 an AM's environment
 ---

 Key: YARN-475
 URL: https://issues.apache.org/jira/browse/YARN-475
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah

 AMs are expected to use ApplicationConstants.AM_CONTAINER_ID_ENV and derive 
 the application attempt id from the container id. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-475) Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in an AM's environment


 [ 
https://issues.apache.org/jira/browse/YARN-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-475:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-386

 Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in 
 an AM's environment
 ---

 Key: YARN-475
 URL: https://issues.apache.org/jira/browse/YARN-475
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Hitesh Shah

 AMs are expected to use ApplicationConstants.AM_CONTAINER_ID_ENV and derive 
 the application attempt id from the container id. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-435) Make it easier to access cluster topology information in an AM


 [ 
https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-435:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-386

 Make it easier to access cluster topology information in an AM
 --

 Key: YARN-435
 URL: https://issues.apache.org/jira/browse/YARN-435
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah

 ClientRMProtocol exposes a getClusterNodes api that provides a report on all 
 nodes in the cluster including their rack information. 
 However, this requires the AM to open and establish a separate connection to 
 the RM in addition to one for the AMRMProtocol. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-475) Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in an AM's environment

2013-03-28 Thread Hitesh Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated YARN-475:
-

Attachment: YARN-475.1.patch

Trivial patch - no tests.

 Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in 
 an AM's environment
 ---

 Key: YARN-475
 URL: https://issues.apache.org/jira/browse/YARN-475
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Hitesh Shah
 Attachments: YARN-475.1.patch


 AMs are expected to use ApplicationConstants.AM_CONTAINER_ID_ENV and derive 
 the application attempt id from the container id. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-209) Capacity scheduler doesn't trigger app-activation after adding nodes

2013-03-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616790#comment-13616790
 ] 

Hudson commented on YARN-209:
-

Integrated in Hadoop-trunk-Commit #3537 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3537/])
YARN-209. Fix CapacityScheduler to trigger application-activation when the 
cluster capacity changes. Contributed by Zhijie Shen. (Revision 1461773)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1461773
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


 Capacity scheduler doesn't trigger app-activation after adding nodes
 

 Key: YARN-209
 URL: https://issues.apache.org/jira/browse/YARN-209
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Bikas Saha
Assignee: Zhijie Shen
 Fix For: 2.0.5-beta

 Attachments: YARN-209.1.patch, YARN-209.2.patch, YARN-209.3.patch, 
 YARN-209.4.patch, YARN-209-test.patch


 Say application A is submitted but at that time it does not meet the bar for 
 activation because of resource limit settings for applications. After that if 
 more hardware is added to the system and the application becomes valid it 
 still remains in pending state, likely forever.
 This might be rare to hit in real life because enough NM's heartbeat to the 
 RM before applications can get submitted. But a change in settings or 
 heartbeat interval might make it easier to repro. In RM restart scenarios, 
 this will likely hit more if its implemented by re-playing events and 
 re-submitting applications to the scheduler before the RPC to NM's is 
 activated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-450) Define value for * in the scheduling protocol

2013-03-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616792#comment-13616792
 ] 

Hudson commented on YARN-450:
-

Integrated in Hadoop-trunk-Commit #3537 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3537/])
YARN-450. Define value for * in the scheduling protocol (Zhijie Shen via 
bikas) (Revision 1462271)

 Result = SUCCESS
bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1462271
Files : 
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRAppBenchmark.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/AMRMClient.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/AMRMClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestAMRMClient.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/Application.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/Task.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
*

[jira] [Commented] (YARN-24) Nodemanager fails to start if log aggregation enabled and namenode unavailable

2013-03-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616793#comment-13616793
 ] 

Hudson commented on YARN-24:


Integrated in Hadoop-trunk-Commit #3537 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3537/])
YARN-24. Nodemanager fails to start if log aggregation enabled and namenode 
unavailable. (sandyr via tucu) (Revision 1461891)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1461891
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java


 Nodemanager fails to start if log aggregation enabled and namenode unavailable
 --

 Key: YARN-24
 URL: https://issues.apache.org/jira/browse/YARN-24
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3, 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Sandy Ryza
 Fix For: 2.0.5-beta

 Attachments: YARN-24-1.patch, YARN-24-2.patch, YARN-24-3.patch, 
 YARN-24.patch


 If log aggregation is enabled and the namenode is currently unavailable, the 
 nodemanager fails to startup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-309) Make RM provide heartbeat interval to NM


[ 
https://issues.apache.org/jira/browse/YARN-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616804#comment-13616804
 ] 

Hadoop QA commented on YARN-309:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575825/YARN-309.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/620//console

This message is automatically generated.

 Make RM provide heartbeat interval to NM
 

 Key: YARN-309
 URL: https://issues.apache.org/jira/browse/YARN-309
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-309.1.patch, YARN-309.2.patch, YARN-309.3.patch, 
 YARN-309.4.patch, YARN-309.5.patch, YARN-309.6.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-477) MiniYARNCluster: When container executor script fails to launch App Master, NM logs error, but Client doesn't get signaled to kill the job

[
https://issues.apache.org/jira/browse/YARN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616817#comment-13616817
]

Vinod Kumar Vavilapalli commented on YARN-477:
--

Eli, please reopen the ticket if you run into this again. Tx.

MiniYARNCluster: When container executor script fails to launch App Master,
NM logs error, but Client doesn't get signaled to kill the job
--

Key: YARN-477
URL: https://issues.apache.org/jira/browse/YARN-477
Project: Hadoop YARN
Issue Type: Bug
Reporter: Eli Reisman
Assignee: Zhijie Shen

I have been porting Giraph to YARN (GIRAPH-13 is the issue) and when I launch
my App Master, if the container command line runs it successfully, any
failure in the App Master or my launched Giraph Tasks promptly reports to
Client and ends my job run. However, if the command line sent to the app
master container fails to launch it at all, the error exit code is not
propagating. My client hangs with the job at containersUsed == 1 and state ==
ACCEPTED for as long as you want to sit and wait before CTRL-C'ing your way
out.
Disclaimer: this could be my fault. But I wanted to throw it out there in
case its not. I also (when this happens) not getting error logs since the app
master never launched, so I really have no visibility into why it failed to
launch. I am sure its not launching, but the client IS sending the app
request, getting a container for my AM, and I see the command line run on the
container in my logs. Thats all.
Thanks! If this is a dup or won't fix for some reason, let me know and
sorry for wasting your time!

[jira] [Commented] (YARN-509) $var shell substitution in properties are not expanded in hadoop-policy.xml


[ 
https://issues.apache.org/jira/browse/YARN-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616837#comment-13616837
 ] 

Hadoop QA commented on YARN-509:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575809/YARN-509.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/618//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/618//console

This message is automatically generated.

 $var shell substitution in properties are not expanded in hadoop-policy.xml
 ---

 Key: YARN-509
 URL: https://issues.apache.org/jira/browse/YARN-509
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
 Environment: BigTop Kerberized cluster test environment
Reporter: Konstantin Boudnik
Priority: Blocker
 Fix For: 3.0.0, 2.0.4-alpha

 Attachments: YARN-509.patch.txt


 During BigTop 0.6.0 release test cycle, [~rvs] came around the following 
 problem:
 {noformat}
 013-03-26 15:37:03,573 FATAL
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting
 NodeManager
 org.apache.hadoop.yarn.YarnException: Failed to Start
 org.apache.hadoop.yarn.server.nodemanager.NodeManager
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:359)
 Caused by: org.apache.avro.AvroRuntimeException:
 java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:162)
 at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
 ... 3 more
 Caused by: java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:128)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:61)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:158)
 ... 4 more
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
 User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not
 authorized for protocol interface
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB, expected client
 Kerberos principal is yarn/ip-10-46-37-244.ec2.internal@BIGTOP
 at org.apache.hadoop.ipc.Client.call(Client.java:1235)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at $Proxy26.registerNodeManager(Unknown Source)
 at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
 ... 6 more
 {noformat}
 The most significant part is 
 {{User yarn/ip-10-46-37-244.ec2.internal@BIGTOP (auth:KERBEROS) is not 
 authorized for protocol interface  
 org.apache.hadoop.yarn.server.api.ResourceTrackerPB}} indicating that

[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows


[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616846#comment-13616846
 ] 

Hadoop QA commented on YARN-493:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575800/YARN-493.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/619//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/619//console

This message is automatically generated.

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.


[ 
https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616856#comment-13616856
 ] 

Hadoop QA commented on YARN-101:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575823/YARN-101.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/622//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/622//console

This message is automatically generated.

 If  the heartbeat message loss, the nodestatus info of complete container 
 will loss too.
 

 Key: YARN-101
 URL: https://issues.apache.org/jira/browse/YARN-101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: suse.
Reporter: xieguiming
Assignee: Xuan Gong
Priority: Minor
 Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch


 see the red color:
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
  protected void startStatusUpdater() {
 new Thread(Node Status Updater) {
   @Override
   @SuppressWarnings(unchecked)
   public void run() {
 int lastHeartBeatID = 0;
 while (!isStopped) {
   // Send heartbeat
   try {
 synchronized (heartbeatMonitor) {
   heartbeatMonitor.wait(heartBeatInterval);
 }
 {color:red} 
 // Before we send the heartbeat, we get the NodeStatus,
 // whose method removes completed containers.
 NodeStatus nodeStatus = getNodeStatus();
  {color}
 nodeStatus.setResponseId(lastHeartBeatID);
 
 NodeHeartbeatRequest request = recordFactory
 .newRecordInstance(NodeHeartbeatRequest.class);
 request.setNodeStatus(nodeStatus);   
 {color:red} 
// But if the nodeHeartbeat fails, we've already removed the 
 containers away to know about it. We aren't handling a nodeHeartbeat failure 
 case here.
 HeartbeatResponse response =
   resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color} 
 if (response.getNodeAction() == NodeAction.SHUTDOWN) {
   LOG
   .info(Recieved SHUTDOWN signal from Resourcemanager as 
 part of heartbeat, +
hence shutting down.);
   NodeStatusUpdaterImpl.this.stop();
   break;
 }
 if (response.getNodeAction() == NodeAction.REBOOT) {
   LOG.info(Node is out of sync with ResourceManager,
   +  hence rebooting.);
   NodeStatusUpdaterImpl.this.reboot();
   break;
 }
 lastHeartBeatID = response.getResponseId();
 ListContainerId containersToCleanup = response
 .getContainersToCleanupList();
 if (containersToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedContainersEvent(containersToCleanup));
 }
 ListApplicationId appsToCleanup =
 response.getApplicationsToCleanupList();
 //Only start tracking for keepAlive on FINISH_APP
 trackAppsForKeepAlive(appsToCleanup);
 if (appsToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedAppsEvent(appsToCleanup));

[jira] [Updated] (YARN-309) Make RM provide heartbeat interval to NM

2013-03-28 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-309:
---

Attachment: YARN-309.7.patch

1.Upload the new patch based on the lastest trunk version
2. fix the compile error

 Make RM provide heartbeat interval to NM
 

 Key: YARN-309
 URL: https://issues.apache.org/jira/browse/YARN-309
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-309.1.patch, YARN-309.2.patch, YARN-309.3.patch, 
 YARN-309.4.patch, YARN-309.5.patch, YARN-309.6.patch, YARN-309.7.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.

2013-03-28 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-101:
---

Attachment: YARN-101.4.patch

Fix testcase failure

 If  the heartbeat message loss, the nodestatus info of complete container 
 will loss too.
 

 Key: YARN-101
 URL: https://issues.apache.org/jira/browse/YARN-101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: suse.
Reporter: xieguiming
Assignee: Xuan Gong
Priority: Minor
 Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, 
 YARN-101.4.patch


 see the red color:
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
  protected void startStatusUpdater() {
 new Thread(Node Status Updater) {
   @Override
   @SuppressWarnings(unchecked)
   public void run() {
 int lastHeartBeatID = 0;
 while (!isStopped) {
   // Send heartbeat
   try {
 synchronized (heartbeatMonitor) {
   heartbeatMonitor.wait(heartBeatInterval);
 }
 {color:red} 
 // Before we send the heartbeat, we get the NodeStatus,
 // whose method removes completed containers.
 NodeStatus nodeStatus = getNodeStatus();
  {color}
 nodeStatus.setResponseId(lastHeartBeatID);
 
 NodeHeartbeatRequest request = recordFactory
 .newRecordInstance(NodeHeartbeatRequest.class);
 request.setNodeStatus(nodeStatus);   
 {color:red} 
// But if the nodeHeartbeat fails, we've already removed the 
 containers away to know about it. We aren't handling a nodeHeartbeat failure 
 case here.
 HeartbeatResponse response =
   resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color} 
 if (response.getNodeAction() == NodeAction.SHUTDOWN) {
   LOG
   .info(Recieved SHUTDOWN signal from Resourcemanager as 
 part of heartbeat, +
hence shutting down.);
   NodeStatusUpdaterImpl.this.stop();
   break;
 }
 if (response.getNodeAction() == NodeAction.REBOOT) {
   LOG.info(Node is out of sync with ResourceManager,
   +  hence rebooting.);
   NodeStatusUpdaterImpl.this.reboot();
   break;
 }
 lastHeartBeatID = response.getResponseId();
 ListContainerId containersToCleanup = response
 .getContainersToCleanupList();
 if (containersToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedContainersEvent(containersToCleanup));
 }
 ListApplicationId appsToCleanup =
 response.getApplicationsToCleanupList();
 //Only start tracking for keepAlive on FINISH_APP
 trackAppsForKeepAlive(appsToCleanup);
 if (appsToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedAppsEvent(appsToCleanup));
 }
   } catch (Throwable e) {
 // TODO Better error handling. Thread can die with the rest of the
 // NM still running.
 LOG.error(Caught exception in status-updater, e);
   }
 }
   }
 }.start();
   }
   private NodeStatus getNodeStatus() {
 NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class);
 nodeStatus.setNodeId(this.nodeId);
 int numActiveContainers = 0;
 ListContainerStatus containersStatuses = new 
 ArrayListContainerStatus();
 for (IteratorEntryContainerId, Container i =
 this.context.getContainers().entrySet().iterator(); i.hasNext();) {
   EntryContainerId, Container e = i.next();
   ContainerId containerId = e.getKey();
   Container container = e.getValue();
   // Clone the container to send it to the RM
   org.apache.hadoop.yarn.api.records.ContainerStatus containerStatus = 
   container.cloneAndGetContainerStatus();
   containersStatuses.add(containerStatus);
   ++numActiveContainers;
   LOG.info(Sending out status for container:  + containerStatus);
   {color:red} 
   // Here is the part that removes the completed containers.
   if (containerStatus.getState() == ContainerState.COMPLETE) {
 // Remove
 i.remove();
   {color} 
 LOG.info(Removed completed container  + containerId);
   }
 }
 nodeStatus.setContainersStatuses(containersStatuses);
 LOG.debug(this.nodeId +  sending out status

[jira] [Updated] (YARN-193) Scheduler.normalizeRequest does not account for allocation requests that exceed maximumAllocation limits


 [ 
https://issues.apache.org/jira/browse/YARN-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-193:
-

Attachment: YARN-193.9.patch

Merge agains the latest trunk, and replace newly introduced * with 
ResourceRequest.ANY, as YARN-450 has been committed.

 Scheduler.normalizeRequest does not account for allocation requests that 
 exceed maximumAllocation limits 
 -

 Key: YARN-193
 URL: https://issues.apache.org/jira/browse/YARN-193
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 3.0.0
Reporter: Hitesh Shah
Assignee: Zhijie Shen
 Attachments: MR-3796.1.patch, MR-3796.2.patch, MR-3796.3.patch, 
 MR-3796.wip.patch, YARN-193.4.patch, YARN-193.5.patch, YARN-193.6.patch, 
 YARN-193.7.patch, YARN-193.8.patch, YARN-193.9.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-03-28 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616917#comment-13616917
 ] 

Sandy Ryza commented on YARN-392:
-

Uploading a patch based on the earlier discussion here and on YARN-398.  The 
patch adds a boolean flag to each resource request which essentially means 
don't schedule using this resource request or any above it and adds support 
for it to the fair scheduler.  I call the flag noAllocateAt, but we could 
definitely use a better name if anybody has suggestions.  I didn't use 
blacklist because it already has a meaning in the context of mapreduce, and 
to me seems to imply that a blacklisted rack would not allow any containers to 
be scheduled anywhere on it, when the meaning is a little different.

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-03-28 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-392:


Attachment: YARN-392-1.patch

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits

2013-03-28 Thread Omkar Vinit Joshi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-467:
---

Attachment: yarn-467-20130328.patch

Incorporating the comments.

 Jobs fail during resource localization when public distributed-cache hits 
 unix directory limits
 ---

 Key: YARN-467
 URL: https://issues.apache.org/jira/browse/YARN-467
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.0.0-alpha
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch, 
 yarn-467-20130322.3.patch, yarn-467-20130322.patch, 
 yarn-467-20130325.1.patch, yarn-467-20130325.path, yarn-467-20130328.patch


 If we have multiple jobs which uses distributed cache with small size of 
 files, the directory limit reaches before reaching the cache size and fails 
 to create any directories in file cache (PUBLIC). The jobs start failing with 
 the below exception.
 java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 
 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
   at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 we need to have a mechanism where in we can create directory hierarchy and 
 limit number of files per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-475) Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in an AM's environment


[ 
https://issues.apache.org/jira/browse/YARN-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616969#comment-13616969
 ] 

Hadoop QA commented on YARN-475:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575966/YARN-475.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/623//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/623//console

This message is automatically generated.

 Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in 
 an AM's environment
 ---

 Key: YARN-475
 URL: https://issues.apache.org/jira/browse/YARN-475
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Hitesh Shah
 Attachments: YARN-475.1.patch


 AMs are expected to use ApplicationConstants.AM_CONTAINER_ID_ENV and derive 
 the application attempt id from the container id. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-193) Scheduler.normalizeRequest does not account for allocation requests that exceed maximumAllocation limits


[ 
https://issues.apache.org/jira/browse/YARN-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616980#comment-13616980
 ] 

Hadoop QA commented on YARN-193:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575991/YARN-193.9.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/625//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/625//console

This message is automatically generated.

 Scheduler.normalizeRequest does not account for allocation requests that 
 exceed maximumAllocation limits 
 -

 Key: YARN-193
 URL: https://issues.apache.org/jira/browse/YARN-193
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 3.0.0
Reporter: Hitesh Shah
Assignee: Zhijie Shen
 Attachments: MR-3796.1.patch, MR-3796.2.patch, MR-3796.3.patch, 
 MR-3796.wip.patch, YARN-193.4.patch, YARN-193.5.patch, YARN-193.6.patch, 
 YARN-193.7.patch, YARN-193.8.patch, YARN-193.9.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.


[ 
https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616982#comment-13616982
 ] 

Hadoop QA commented on YARN-101:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575989/YARN-101.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/626//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/626//console

This message is automatically generated.

 If  the heartbeat message loss, the nodestatus info of complete container 
 will loss too.
 

 Key: YARN-101
 URL: https://issues.apache.org/jira/browse/YARN-101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: suse.
Reporter: xieguiming
Assignee: Xuan Gong
Priority: Minor
 Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, 
 YARN-101.4.patch


 see the red color:
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
  protected void startStatusUpdater() {
 new Thread(Node Status Updater) {
   @Override
   @SuppressWarnings(unchecked)
   public void run() {
 int lastHeartBeatID = 0;
 while (!isStopped) {
   // Send heartbeat
   try {
 synchronized (heartbeatMonitor) {
   heartbeatMonitor.wait(heartBeatInterval);
 }
 {color:red} 
 // Before we send the heartbeat, we get the NodeStatus,
 // whose method removes completed containers.
 NodeStatus nodeStatus = getNodeStatus();
  {color}
 nodeStatus.setResponseId(lastHeartBeatID);
 
 NodeHeartbeatRequest request = recordFactory
 .newRecordInstance(NodeHeartbeatRequest.class);
 request.setNodeStatus(nodeStatus);   
 {color:red} 
// But if the nodeHeartbeat fails, we've already removed the 
 containers away to know about it. We aren't handling a nodeHeartbeat failure 
 case here.
 HeartbeatResponse response =
   resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color} 
 if (response.getNodeAction() == NodeAction.SHUTDOWN) {
   LOG
   .info(Recieved SHUTDOWN signal from Resourcemanager as 
 part of heartbeat, +
hence shutting down.);
   NodeStatusUpdaterImpl.this.stop();
   break;
 }
 if (response.getNodeAction() == NodeAction.REBOOT) {
   LOG.info(Node is out of sync with ResourceManager,
   +  hence rebooting.);
   NodeStatusUpdaterImpl.this.reboot();
   break;
 }
 lastHeartBeatID = response.getResponseId();
 ListContainerId containersToCleanup = response
 .getContainersToCleanupList();
 if (containersToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedContainersEvent(containersToCleanup));
 }
 ListApplicationId appsToCleanup =
 response.getApplicationsToCleanupList();
 //Only start tracking for keepAlive on FINISH_APP
 trackAppsForKeepAlive(appsToCleanup);
 if (appsToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedAppsEvent(appsToCleanup));
 }
   } catch (Throwable e) {

[jira] [Commented] (YARN-309) Make RM provide heartbeat interval to NM


[ 
https://issues.apache.org/jira/browse/YARN-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616989#comment-13616989
 ] 

Hadoop QA commented on YARN-309:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12575985/YARN-309.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/624//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/624//console

This message is automatically generated.

 Make RM provide heartbeat interval to NM
 

 Key: YARN-309
 URL: https://issues.apache.org/jira/browse/YARN-309
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-309.1.patch, YARN-309.2.patch, YARN-309.3.patch, 
 YARN-309.4.patch, YARN-309.5.patch, YARN-309.6.patch, YARN-309.7.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits


[ 
https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616995#comment-13616995
 ] 

Hadoop QA commented on YARN-467:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12576003/yarn-467-20130328.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/627//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/627//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/627//console

This message is automatically generated.

 Jobs fail during resource localization when public distributed-cache hits 
 unix directory limits
 ---

 Key: YARN-467
 URL: https://issues.apache.org/jira/browse/YARN-467
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.0.0-alpha
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch, 
 yarn-467-20130322.3.patch, yarn-467-20130322.patch, 
 yarn-467-20130325.1.patch, yarn-467-20130325.path, yarn-467-20130328.patch


 If we have multiple jobs which uses distributed cache with small size of 
 files, the directory limit reaches before reaching the cache size and fails 
 to create any directories in file cache (PUBLIC). The jobs start failing with 
 the below exception.
 java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 
 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
   at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 we need to have a mechanism where in we can create directory hierarchy and 
 limit number of files per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2013-03-28 Thread nemon lou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617008#comment-13617008
 ] 

nemon lou commented on YARN-276:


[~zjshen]
Yes,a dynamic maxActiveApplications will work ,too.And no need adding any new 
criteria .I'll give it a try .
Thanks.


 Capacity Scheduler can hang when submit many jobs concurrently
 --

 Key: YARN-276
 URL: https://issues.apache.org/jira/browse/YARN-276
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0, 2.0.1-alpha
Reporter: nemon lou
 Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
 YARN-276.patch, YARN-276.patch, YARN-276.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
 scheduler can hang with most resources taken up by AM and don't have enough 
 resources for tasks.And then all applications hang there.
 The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not 
 check directly.Instead ,this property only used for maxActiveApplications. 
 And maxActiveApplications is computed by minimumAllocation (not by Am 
 actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows

2013-03-28 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617063#comment-13617063
 ] 

Chris Nauroth commented on YARN-493:


The test failure is unrelated.  I suspect it was introduced in the patch for 
HADOOP-9357.  I've added comments on that issue to discuss.

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently


 [ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-276:


Assignee: nemon lou

Assigning this to [~nemon].

 Capacity Scheduler can hang when submit many jobs concurrently
 --

 Key: YARN-276
 URL: https://issues.apache.org/jira/browse/YARN-276
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0, 2.0.1-alpha
Reporter: nemon lou
Assignee: nemon lou
 Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
 YARN-276.patch, YARN-276.patch, YARN-276.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
 scheduler can hang with most resources taken up by AM and don't have enough 
 resources for tasks.And then all applications hang there.
 The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not 
 check directly.Instead ,this property only used for maxActiveApplications. 
 And maxActiveApplications is computed by minimumAllocation (not by Am 
 actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-516) TestContainerLocalizer.testContainerLocalizerMain is failing

Vinod Kumar Vavilapalli created YARN-516:


 Summary: TestContainerLocalizer.testContainerLocalizerMain is 
failing
 Key: YARN-516
 URL: https://issues.apache.org/jira/browse/YARN-516
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-516) TestContainerLocalizer.testContainerLocalizerMain is failing