[
https://issues.apache.org/jira/browse/YARN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ding Yuan updated YARN-1677:
----------------------------
Description:
Hi Yarn developers,
We are a group of researchers on software reliability, and recently we did a
study and found that majority of the most severe failures in hadoop are caused
by bugs in exception handling logic. Therefore we built a simple checking tool
that automatically detects some bug patterns that have caused some very severe
failures. I am reporting some of the results for Yarn here. Any feedback is
much appreciated!
==========================
Case 1:
Line: 551, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
{noformat}
switch (monitoringEvent.getType()) {
case START_MONITORING_CONTAINER:
.. ..
default:
// TODO: Wrong event.
}
{noformat}
The switch fall-through (handling any potential unexpected event) is empty.
Should we at least print an error message here?
==========================
==========================
Case 2:
Line: 491, File:
"org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
{noformat}
} catch (Throwable e) {
// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
}
{noformat}
The handler of this very general exception only logs the error. The TODO seems
to indicate it is not sufficient.
==========================
==========================
Case 3:
Line: 861, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
for (LocalResourceStatus stat : remoteResourceStatuses) {
LocalResource rsrc = stat.getResource();
LocalResourceRequest req = null;
try {
req = new LocalResourceRequest(rsrc);
} catch (URISyntaxException e) {
// TODO fail? Already translated several times...
}
The handler for URISyntaxException is empty, and the TODO seems to indicate it
is not sufficient.
The same code pattern can also be found at:
Line: 901, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 838, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 878, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
At line: 803, File:
org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java,
the handler of URISyntaxException also seems not sufficient:
{noformat}
try {
shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
shellScriptPath)));
} catch (URISyntaxException e) {
LOG.error("Error when trying to use shell script path specified"
+ " in env, path=" + shellScriptPath);
e.printStackTrace();
// A failure scenario on bad input such as invalid shell script path
// We know we cannot continue launching the container
// so we should release it.
// TODO
numCompletedContainers.incrementAndGet();
numFailedContainers.incrementAndGet();
return;
}
{noformat}
==========================
==========================
Case 4:
Line: 627, File:
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"
{noformat}
try {
/* keep the master in sync with the state machine */
this.stateMachine.doTransition(event.getType(), event);
} catch (InvalidStateTransitonException e) {
LOG.error("Can't handle this event at current state", e);
/* TODO fail the application on the failed transition */
}
{noformat}
The handler of this exception only logs the error. The TODO seems to indicate
it is not sufficient.
This exact same code pattern can also be found at:
Line: 573, File:
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
==========================
==========================
Case 5: empty handler for exception: java.lang.InterruptedException
Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"
{noformat}
public void join() {
if(proxyServer != null) {
try {
proxyServer.join();
} catch (InterruptedException e) {
}
}
}
{noformat}
The InterruptedException is completely ignored. As a result, any events causing
this interrupt will be lost.
More info on why InterruptedException shouldn't be ignored:
http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception
This pattern of handling InterruptedException can be found in a few other
places:
Line: 434, File:
org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
{noformat}
try {
event = eventQueue.take();
} catch (InterruptedException e) {
LOG.error("Returning, interrupted : " + e);
return; // TODO: Kill RM.
}
{noformat}
Line: 722, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 191, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
Line: 393, File:
"org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
Line: 258, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
==========================
==========================
Case 6: potential divide by zero
line: 581, File:
org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
{noformat}
int availableContainers =
node.getAvailableResource().getMemory() / capability.getMemory(); //
TODO: A buggy
// application
// with this
// zero would
// crash the
// scheduler.
{noformat}
Should this potential divide by zero be handled?
==========================
==========================
Case 7:
Line: 260, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
{noformat}
} catch (YarnException e) {
// TODO cleanup
return;
}
{noformat}
The error handler simply returns without proper clean up.
==========================
was:
Hi Yarn developers,
We are a group of researchers on software reliability, and recently we did a
study and found that majority of the most severe failures in hadoop are caused
by bugs in exception handling logic. Therefore we built a simple checking tool
that automatically detects some bug patterns that have caused some very severe
failures. I am reporting some of the results for Yarn here. Any feedback is
much appreciated!
==========================
Case 1:
Line: 551, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
{noformat}
switch (monitoringEvent.getType()) {
case START_MONITORING_CONTAINER:
.. ..
default:
// TODO: Wrong event.
}
{noformat}
The switch fall-through (handling any potential unexpected event) is empty.
Should we at least print an error message here?
==========================
==========================
Case 2:
Line: 491, File:
"org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
{noformat}
} catch (Throwable e) {
// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
}
{noformat}
The handler of this very general exception only logs the error. The TODO seems
to indicate it is not sufficient.
==========================
==========================
Case 3:
Line: 861, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
for (LocalResourceStatus stat : remoteResourceStatuses) {
LocalResource rsrc = stat.getResource();
LocalResourceRequest req = null;
try {
req = new LocalResourceRequest(rsrc);
} catch (URISyntaxException e) {
// TODO fail? Already translated several times...
}
The handler for URISyntaxException is empty, and the TODO seems to indicate it
is not sufficient.
The same code pattern can also be found at:
Line: 901, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 838, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 878, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
At line: 803, File:
org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java,
the handler of URISyntaxException also seems not sufficient:
{noformat}
try {
shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
shellScriptPath)));
} catch (URISyntaxException e) {
LOG.error("Error when trying to use shell script path specified"
+ " in env, path=" + shellScriptPath);
e.printStackTrace();
// A failure scenario on bad input such as invalid shell script path
// We know we cannot continue launching the container
// so we should release it.
// TODO
numCompletedContainers.incrementAndGet();
numFailedContainers.incrementAndGet();
return;
}
{noformat}
==========================
==========================
Case 4:
Line: 627, File:
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"
{noformat}
try {
/* keep the master in sync with the state machine */
this.stateMachine.doTransition(event.getType(), event);
} catch (InvalidStateTransitonException e) {
LOG.error("Can't handle this event at current state", e);
/* TODO fail the application on the failed transition */
}
{noformat}
The handler of this exception only logs the error. The TODO seems to indicate
it is not sufficient.
This exact same code pattern can also be found at:
Line: 573, File:
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
==========================
==========================
Case 5: empty handler for exception: java.lang.InterruptedException
Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"
{noformat}
public void join() {
if(proxyServer != null) {
try {
proxyServer.join();
} catch (InterruptedException e) {
}
}
}
{noformat}
The InterruptedException is completely ignored. As a result, any events causing
this interrupt will be lost.
More info on why InterruptedException shouldn't be ignored:
http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception
This pattern of handling InterruptedException can be found in a few other
places:
Line: 434, File:
org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
{noformat}
try {
event = eventQueue.take();
} catch (InterruptedException e) {
LOG.error("Returning, interrupted : " + e);
return; // TODO: Kill RM.
}
{noformat}
Line: 722, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 191, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
Line: 393, File:
"org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
Line: 258, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
==========================
==========================
Case 6: potential divide by zero
line: 581, File:
org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
{noformat}
int availableContainers =
node.getAvailableResource().getMemory() / capability.getMemory(); //
TODO: A buggy
//
application
// with
this
// zero
would
//
crash the
//
scheduler.
{noformat}
Should this potential divide by zero be handled?
==========================
==========================
Case 7:
Line: 260, File:
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
{noformat}
} catch (YarnException e) {
// TODO cleanup
return;
}
{noformat}
The error handler simply returns without proper clean up.
==========================
> Potential bugs in exception handlers
> ------------------------------------
>
> Key: YARN-1677
> URL: https://issues.apache.org/jira/browse/YARN-1677
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.2.0
> Reporter: Ding Yuan
>
> Hi Yarn developers,
> We are a group of researchers on software reliability, and recently we did a
> study and found that majority of the most severe failures in hadoop are
> caused by bugs in exception handling logic. Therefore we built a simple
> checking tool that automatically detects some bug patterns that have caused
> some very severe failures. I am reporting some of the results for Yarn here.
> Any feedback is much appreciated!
> ==========================
> Case 1:
> Line: 551, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
> {noformat}
> switch (monitoringEvent.getType()) {
> case START_MONITORING_CONTAINER:
> .. ..
> default:
> // TODO: Wrong event.
> }
> {noformat}
> The switch fall-through (handling any potential unexpected event) is empty.
> Should we at least print an error message here?
> ==========================
> ==========================
> Case 2:
> Line: 491, File:
> "org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
> {noformat}
> } catch (Throwable e) {
> // TODO Better error handling. Thread can die with the rest of the
> // NM still running.
> LOG.error("Caught exception in status-updater", e);
> }
> {noformat}
> The handler of this very general exception only logs the error. The TODO
> seems to indicate it is not sufficient.
> ==========================
> ==========================
> Case 3:
> Line: 861, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> for (LocalResourceStatus stat : remoteResourceStatuses) {
> LocalResource rsrc = stat.getResource();
> LocalResourceRequest req = null;
> try {
> req = new LocalResourceRequest(rsrc);
> } catch (URISyntaxException e) {
> // TODO fail? Already translated several times...
> }
> The handler for URISyntaxException is empty, and the TODO seems to indicate
> it is not sufficient.
> The same code pattern can also be found at:
> Line: 901, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 838, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 878, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> At line: 803, File:
> org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java,
> the handler of URISyntaxException also seems not sufficient:
> {noformat}
> try {
> shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
> shellScriptPath)));
> } catch (URISyntaxException e) {
> LOG.error("Error when trying to use shell script path specified"
> + " in env, path=" + shellScriptPath);
> e.printStackTrace();
> // A failure scenario on bad input such as invalid shell script path
> // We know we cannot continue launching the container
> // so we should release it.
> // TODO
> numCompletedContainers.incrementAndGet();
> numFailedContainers.incrementAndGet();
> return;
> }
> {noformat}
> ==========================
> ==========================
> Case 4:
> Line: 627, File:
> "org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"
> {noformat}
> try {
> /* keep the master in sync with the state machine */
> this.stateMachine.doTransition(event.getType(), event);
> } catch (InvalidStateTransitonException e) {
> LOG.error("Can't handle this event at current state", e);
> /* TODO fail the application on the failed transition */
> }
> {noformat}
> The handler of this exception only logs the error. The TODO seems to indicate
> it is not sufficient.
> This exact same code pattern can also be found at:
> Line: 573, File:
> "org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
> ==========================
> ==========================
> Case 5: empty handler for exception: java.lang.InterruptedException
> Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"
> {noformat}
> public void join() {
> if(proxyServer != null) {
> try {
> proxyServer.join();
> } catch (InterruptedException e) {
> }
> }
> }
> {noformat}
> The InterruptedException is completely ignored. As a result, any events
> causing this interrupt will be lost.
> More info on why InterruptedException shouldn't be ignored:
> http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception
> This pattern of handling InterruptedException can be found in a few other
> places:
> Line: 434, File:
> org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
> {noformat}
> try {
> event = eventQueue.take();
> } catch (InterruptedException e) {
> LOG.error("Returning, interrupted : " + e);
> return; // TODO: Kill RM.
> }
> {noformat}
> Line: 722, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 191, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
> Line: 393, File:
> "org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
> Line: 258, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
> ==========================
> ==========================
> Case 6: potential divide by zero
> line: 581, File:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
> {noformat}
> int availableContainers =
> node.getAvailableResource().getMemory() / capability.getMemory(); //
> TODO: A buggy
> // application
> // with this
> // zero would
> // crash the
> // scheduler.
> {noformat}
> Should this potential divide by zero be handled?
> ==========================
> ==========================
> Case 7:
> Line: 260, File:
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
> {noformat}
> } catch (YarnException e) {
> // TODO cleanup
> return;
> }
> {noformat}
> The error handler simply returns without proper clean up.
> ==========================
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)