[jira] [Commented] (YARN-244) Application Master Retries fail due to FileNotFoundException

2012-11-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503832#comment-13503832
 ] 

Jason Lowe commented on YARN-244:
-

could you provide a bit more detail from the AM logs when this occurs?  I'm not 
able to reproduce this with a sleep job and manually killing the AM to simulate 
failure.  Normally the AM tries to determine if it is the last attempt and only 
deletes the files if it is convinced there will be more attempts.  If you could 
provide steps to reproduce or details from the AM logs showing why it decided 
to remove the staging directory that would help clarify what's going on in this 
case.

 Application Master Retries fail due to FileNotFoundException
 

 Key: YARN-244
 URL: https://issues.apache.org/jira/browse/YARN-244
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Blocker

 Application attempt1 is deleting the job related files and these are not 
 present in the HDFS for following retries.
 {code:xml}
 Application application_1353724754961_0001 failed 4 times due to AM Container 
 for appattempt_1353724754961_0001_04 exited with exitCode: -1000 due to: 
 RemoteTrace: java.io.FileNotFoundException: File does not exist: 
 hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
  at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
  at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at 
 org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at 
 org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at 
 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662) at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File 
 does not exist: 
 hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:822)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:492)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:221)
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
  at 
 org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
  at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) .Failing this 
 attempt.. Failing the application. 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please 

[jira] [Commented] (YARN-241) Node Manager fails to launch containers after NM restart in secure mode

2012-11-26 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503836#comment-13503836
 ] 

Daryn Sharp commented on YARN-241:
--

Is this maybe caused by a race condition where the NM is receiving a container 
token before the RM registration completes and it receives the secret keys for 
the container tokens?

 Node Manager fails to launch containers after NM restart in secure mode
 ---

 Key: YARN-241
 URL: https://issues.apache.org/jira/browse/YARN-241
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Priority: Blocker

 After restarting the Node Manager it fails to launch containers with the 
 below exception.
  {code:xml}
 2012-11-24 17:21:56,141 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8048: readAndProcess threw exception 
 java.lang.IllegalArgumentException: Invalid key to HMAC computation from 
 client 158.1.131.10. Count of bytes read: 0
 java.lang.IllegalArgumentException: Invalid key to HMAC computation
 at 
 org.apache.hadoop.security.token.SecretManager.createPassword(SecretManager.java:153)
 at 
 org.apache.hadoop.yarn.server.security.ContainerTokenSecretManager.retrievePassword(ContainerTokenSecretManager.java:109)
 at 
 org.apache.hadoop.yarn.server.security.ContainerTokenSecretManager.retrievePassword(ContainerTokenSecretManager.java:44)
 at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:194)
 at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:220)
 at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:568)
 at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:226)
 at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1199)
 at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1393)
 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:710)
 at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:509)
 at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:484)
 Caused by: java.security.InvalidKeyException: No installed provider supports 
 this key: javax.crypto.spec.SecretKeySpec
 at javax.crypto.Mac.a(DashoA13*..)
 at javax.crypto.Mac.init(DashoA13*..)
 at 
 org.apache.hadoop.security.token.SecretManager.createPassword(SecretManager.java:151)
 ... 11 more
  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-243) Job Client doesn't give progress for Application Master Retries

2012-11-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503863#comment-13503863
 ] 

Jason Lowe commented on YARN-243:
-

I tried replicating this with a sleep job and manually killing the AM to force 
AM retries.  In this case the client reconnected to the new AM attempt and 
continued to show map/reduce progress for the new attempt.

 Job Client doesn't give progress for Application Master Retries
 ---

 Key: YARN-243
 URL: https://issues.apache.org/jira/browse/YARN-243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K

 If we configure the AM retries, if the first attempt fails then RM will 
 create next attempt but Job Client doesn't give the progress for the retry 
 attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-241) Node Manager fails to launch containers after NM restart in secure mode

2012-11-26 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503867#comment-13503867
 ] 

Devaraj K commented on YARN-241:


As per my observation when I debug, It is having the secret key and 
mac.init(key) is failing. It fails for all subsequent invocations. If we try 
with new mac instance with same secret key it succeeds.

 Node Manager fails to launch containers after NM restart in secure mode
 ---

 Key: YARN-241
 URL: https://issues.apache.org/jira/browse/YARN-241
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Priority: Blocker

 After restarting the Node Manager it fails to launch containers with the 
 below exception.
  {code:xml}
 2012-11-24 17:21:56,141 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8048: readAndProcess threw exception 
 java.lang.IllegalArgumentException: Invalid key to HMAC computation from 
 client 158.1.131.10. Count of bytes read: 0
 java.lang.IllegalArgumentException: Invalid key to HMAC computation
 at 
 org.apache.hadoop.security.token.SecretManager.createPassword(SecretManager.java:153)
 at 
 org.apache.hadoop.yarn.server.security.ContainerTokenSecretManager.retrievePassword(ContainerTokenSecretManager.java:109)
 at 
 org.apache.hadoop.yarn.server.security.ContainerTokenSecretManager.retrievePassword(ContainerTokenSecretManager.java:44)
 at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:194)
 at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:220)
 at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:568)
 at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:226)
 at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1199)
 at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1393)
 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:710)
 at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:509)
 at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:484)
 Caused by: java.security.InvalidKeyException: No installed provider supports 
 this key: javax.crypto.spec.SecretKeySpec
 at javax.crypto.Mac.a(DashoA13*..)
 at javax.crypto.Mac.init(DashoA13*..)
 at 
 org.apache.hadoop.security.token.SecretManager.createPassword(SecretManager.java:151)
 ... 11 more
  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-243) Job Client doesn't give progress for Application Master Retries

2012-11-26 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503878#comment-13503878
 ] 

Devaraj K commented on YARN-243:


If we kill the AM, the client connects to new AM by getting the latest app 
report from RM, but if the AM attempt fails(Job fails), RM will start the new 
attempt and client shows the previous attempt status i.e Job Failed status. I 
think we need handle this case in the client side.

 Job Client doesn't give progress for Application Master Retries
 ---

 Key: YARN-243
 URL: https://issues.apache.org/jira/browse/YARN-243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K

 If we configure the AM retries, if the first attempt fails then RM will 
 create next attempt but Job Client doesn't give the progress for the retry 
 attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-237) Refreshing the RM page forgets how many rows I had in my Datatables

2012-11-26 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503890#comment-13503890
 ] 

Robert Joseph Evans commented on YARN-237:
--

You have to be careful with cookies because the web app proxy strips out 
cookies before sending the data to the application.

 Refreshing the RM page forgets how many rows I had in my Datatables
 ---

 Key: YARN-237
 URL: https://issues.apache.org/jira/browse/YARN-237
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.4, 3.0.0
Reporter: Ravi Prakash

 If I choose a 100 rows, and then refresh the page, DataTables goes back to 
 showing me 20 rows.
 This user preference should be stored in a cookie.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-243) Job Client doesn't give progress for Application Master Retries

2012-11-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503896#comment-13503896
 ] 

Jason Lowe commented on YARN-243:
-

That doesn't sound like something to fix on the client side.  If the AM told 
the client that the job failed then the job should have failed.  The fact that 
the attempt died between the time it told the client the job final status and 
the RM can happen, and IMHO we should fix things so the subsequent AM attempt 
doesn't retry the job but rather simply updates the RM with the failed status 
found from the previous attempt.  Otherwise we run into bad situations where 
we've already told the client the job failed, but the job subsequently retries 
(possibly from scratch, depending upon the output format support for recovery) 
and could succeed.  If the job has decided to fail and has already told the 
client, an AM attempt failure while trying to report that same decision to the 
RM shouldn't allow the job to subsequently succeed, IMHO.

 Job Client doesn't give progress for Application Master Retries
 ---

 Key: YARN-243
 URL: https://issues.apache.org/jira/browse/YARN-243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K

 If we configure the AM retries, if the first attempt fails then RM will 
 create next attempt but Job Client doesn't give the progress for the retry 
 attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-243) Job Client doesn't give progress for Application Master Retries

2012-11-26 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504011#comment-13504011
 ] 

Vinod Kumar Vavilapalli commented on YARN-243:
--

Agree with Jason. We shouldn't workaround it on the client-side. 

I think we should close this as a duplicate of MAPREDUCE-4819.

 Job Client doesn't give progress for Application Master Retries
 ---

 Key: YARN-243
 URL: https://issues.apache.org/jira/browse/YARN-243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K

 If we configure the AM retries, if the first attempt fails then RM will 
 create next attempt but Job Client doesn't give the progress for the retry 
 attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-224) Fair scheduler logs too many nodeUpdate INFO messages

2012-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-224:


Attachment: YARN-224-1.patch

 Fair scheduler logs too many nodeUpdate INFO messages
 -

 Key: YARN-224
 URL: https://issues.apache.org/jira/browse/YARN-224
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-224-1.patch, YARN-224.patch


 The RM logs are filled with an INFO message the fair scheduler logs every 
 time it receives a nodeUpdate.  It should be taken out or demoted to debug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-224) Fair scheduler logs too many nodeUpdate INFO messages

2012-11-26 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504075#comment-13504075
 ] 

Karthik Kambatla commented on YARN-224:
---

+1

 Fair scheduler logs too many nodeUpdate INFO messages
 -

 Key: YARN-224
 URL: https://issues.apache.org/jira/browse/YARN-224
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-224-1.patch, YARN-224.patch


 The RM logs are filled with an INFO message the fair scheduler logs every 
 time it receives a nodeUpdate.  It should be taken out or demoted to debug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-72) NM should handle cleaning up containers when it shuts down ( and kill containers from an earlier instance when it comes back up after an unclean shutdown )

2012-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-72:
---

Attachment: YARN-72-1.patch

 NM should handle cleaning up containers when it shuts down ( and kill 
 containers from an earlier instance when it comes back up after an unclean 
 shutdown )
 ---

 Key: YARN-72
 URL: https://issues.apache.org/jira/browse/YARN-72
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Sandy Ryza
 Attachments: YARN-72-1.patch, YARN-72.patch


 Ideally, the NM should wait for a limited amount of time when it gets a 
 shutdown signal for existing containers to complete and kill the containers ( 
 if we pick an aggressive approach ) after this time interval. 
 For NMs which come up after an unclean shutdown, the NM should look through 
 its directories for existing container.pids and try and kill an existing 
 containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-72) NM should handle cleaning up containers when it shuts down ( and kill containers from an earlier instance when it comes back up after an unclean shutdown )

2012-11-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504234#comment-13504234
 ] 

Sandy Ryza commented on YARN-72:


Newest patch contains a test and timeout.  The timeout is 
yarn.nodemanager.sleep-delay-before-sigkill.ms + 
yarn.nodemanager.process-kill-wait.ms + 1000.  Should I make this configurable?

 NM should handle cleaning up containers when it shuts down ( and kill 
 containers from an earlier instance when it comes back up after an unclean 
 shutdown )
 ---

 Key: YARN-72
 URL: https://issues.apache.org/jira/browse/YARN-72
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Sandy Ryza
 Attachments: YARN-72-1.patch, YARN-72.patch


 Ideally, the NM should wait for a limited amount of time when it gets a 
 shutdown signal for existing containers to complete and kill the containers ( 
 if we pick an aggressive approach ) after this time interval. 
 For NMs which come up after an unclean shutdown, the NM should look through 
 its directories for existing container.pids and try and kill an existing 
 containers matching the pids found. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-244) Application Master Retries fail due to FileNotFoundException

2012-11-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504386#comment-13504386
 ] 

Bikas Saha commented on YARN-244:
-

Did you check if AM retries were enabled to be  1? Without that the last 
attempt will delete the files. If the AM is being retried by the RM then this 
value should already be  1 though. So there could be a bug.

 Application Master Retries fail due to FileNotFoundException
 

 Key: YARN-244
 URL: https://issues.apache.org/jira/browse/YARN-244
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.0.2-alpha, 2.0.1-alpha
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Blocker

 Application attempt1 is deleting the job related files and these are not 
 present in the HDFS for following retries.
 {code:xml}
 Application application_1353724754961_0001 failed 4 times due to AM Container 
 for appattempt_1353724754961_0001_04 exited with exitCode: -1000 due to: 
 RemoteTrace: java.io.FileNotFoundException: File does not exist: 
 hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
  at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
  at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at 
 org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at 
 org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at 
 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662) at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File 
 does not exist: 
 hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:822)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:492)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:221)
  at 
 org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
  at 
 org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
  at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) .Failing this 
 attempt.. Failing the application. 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira