Hi Omkar, Thanks for the quick reply, and sorry for not being able to get the required logs that you have asked for.
But in the meanwhile I just wanted to check if you can get a clue with the information I have now. I am seeing the following kind of error message in AppMaster.stderr whenever this failure is happening. I don't know why does it happen, the getProgress() call that I have implemented in RMCallbackHandler could never return a negative value! Doesn't this error mean that this getProgress() is giving a -ve value? Exception in thread "AMRM Heartbeater thread" java.lang.IllegalArgumentException: Progress indicator should not be negative at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224) Thanks, Kishore On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <ojo...@hortonworks.com> wrote: > Can you give more information? logs (complete) will help a lot around this > time frame. Are the containers getting assigned via scheduler? is it > failing when node manager tries to start container? I clearly see the > diagnostic message is empty but do you see anything in NM logs? Also if > there were running containers on the machine before launching new ones.. > then are they killed? or they are still hanging around? can you also try > applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and > check if you can see any message? > > Thanks, > Omkar Joshi > *Hortonworks Inc.* <http://www.hortonworks.com> > > > On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri < > write2kish...@gmail.com> wrote: > >> Hi, >> I am using 2.1.0-beta and have seen container allocation failing >> randomly even when running the same application in a loop. I know that the >> cluster has enough resources to give, because it gave the resources for the >> same application all the other times in the loop and ran it successfully. >> >> I have observed a lot of the following kind of messages in the node >> manager's log whenever such failure happens, any clues as to why it happens? >> >> 2013-09-12 08:54:36,204 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:37,220 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:38,231 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:39,239 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:40,267 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:41,275 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:42,283 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> 2013-09-12 08:54:43,289 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending >> out status for container: container_id { app_attempt_id { application_id { >> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state: >> C_RUNNING diagnostics: "" exit_status: -1000 >> >> >> Thanks, >> Kishore >> > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.