Hi,
I met the same problem as :
http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201303.mbox/%[email protected]%3E
<http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201303.mbox/%[email protected]%3E>
Any idea about that?
It almost hadoop every 3 or 4weeks in my cluster(about 150 nodes).
I check the log, no warn, no error, no exception, but the ResouceManager hung,
not crash.
I found this code, but I have no idea why it happens, why the event is bigger
and bigger?
thanks.
private final class EventProcessor implements Runnable {
@Override
public void run() {
SchedulerEvent event;
while (!stopped && !Thread.currentThread().isInterrupted()) {
try {
event = eventQueue.take();
} catch (InterruptedException e) {
LOG.error("Returning, interrupted : " + e);
return; // TODO: Kill RM.
}
try {
scheduler.handle(event);
} catch (Throwable t) {
// An error occurred, but we are shutting down anyway.
// If it was an InterruptedException, the very act of
// shutdown could have caused it and is probably harmless.
if (stopped) {
LOG.warn("Exception during shutdown: ", t);
break;
}
LOG.fatal("Error in handling event type " + event.getType()
+ " to the scheduler", t);
if (shouldExitOnError
&& !ShutdownHookManager.get().isShutdownInProgress()) {
LOG.info("Exiting, bbye..");
System.exit(-1);
}
}
}
}
}
@Override
protected void serviceStop() throws Exception {
this.stopped = true;
this.eventProcessor.interrupt();
try {
this.eventProcessor.join();
} catch (InterruptedException e) {
throw new YarnRuntimeException(e);
}
super.serviceStop();
}
@Override
public void handle(SchedulerEvent event) {
try {
int qSize = eventQueue.size();
if (qSize !=0 && qSize %1000 == 0) {
LOG.info("Size of scheduler event-queue is " + qSize);
}
int remCapacity = eventQueue.remainingCapacity();
if (remCapacity < 1000) {
LOG.info("Very low remaining capacity on scheduler event queue: "
+ remCapacity);
}
this.eventQueue.put(event);
} catch (InterruptedException e) {
throw new YarnRuntimeException(e);
}
}
}
logs:
grep 'Size of event-queue'
yarn-hadoop-resourcemanager-gdc-hm01-formal.i.nease.net.log
2015-05-29 00:54:46,985 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 1000
2015-05-29 00:55:28,850 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 2000
2015-05-29 00:56:10,204 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 3000
2015-05-29 00:56:51,995 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 4000
2015-05-29 00:57:33,981 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 5000
2015-05-29 00:58:15,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 6000
2015-05-29 00:58:57,111 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 7000
2015-05-29 00:59:38,593 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 8000
2015-05-29 01:00:20,215 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 9000
2015-05-29 01:01:00,559 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 10000
2015-05-29 01:01:39,614 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 11000
2015-05-29 01:02:21,364 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 12000
2015-05-29 01:03:03,233 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 13000
2015-05-29 01:03:44,701 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 14000
2015-05-29 01:04:26,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 15000
2015-05-29 01:05:08,180 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 16000
2015-05-29 01:05:50,331 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size
of event-queue is 17000