[ https://issues.apache.org/jira/browse/YARN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ding Yuan updated YARN-1677: ---------------------------- Attachment: yarn-1677.patch > Potential bugs in exception handlers > ------------------------------------ > > Key: YARN-1677 > URL: https://issues.apache.org/jira/browse/YARN-1677 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.2.0 > Reporter: Ding Yuan > Attachments: yarn-1677.patch > > > Hi Yarn developers, > We are a group of researchers on software reliability, and recently we did a > study and found that majority of the most severe failures in hadoop are > caused by bugs in exception handling logic. Therefore we built a simple > checking tool that automatically detects some bug patterns that have caused > some very severe failures. I am reporting some of the results for Yarn here. > Any feedback is much appreciated! > ========================== > Case 1: > Line: 551, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java" > {noformat} > switch (monitoringEvent.getType()) { > case START_MONITORING_CONTAINER: > .. .. > default: > // TODO: Wrong event. > } > {noformat} > The switch fall-through (handling any potential unexpected event) is empty. > Should we at least print an error message here? > ========================== > ========================== > Case 2: > Line: 491, File: > "org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java" > {noformat} > } catch (Throwable e) { > // TODO Better error handling. Thread can die with the rest of the > // NM still running. > LOG.error("Caught exception in status-updater", e); > } > {noformat} > The handler of this very general exception only logs the error. The TODO > seems to indicate it is not sufficient. > ========================== > ========================== > Case 3: > Line: 861, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" > for (LocalResourceStatus stat : remoteResourceStatuses) { > LocalResource rsrc = stat.getResource(); > LocalResourceRequest req = null; > try { > req = new LocalResourceRequest(rsrc); > } catch (URISyntaxException e) { > // TODO fail? Already translated several times... > } > The handler for URISyntaxException is empty, and the TODO seems to indicate > it is not sufficient. > The same code pattern can also be found at: > Line: 901, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" > Line: 838, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" > Line: 878, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" > At line: 803, File: > org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, > the handler of URISyntaxException also seems not sufficient: > {noformat} > try { > shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI( > shellScriptPath))); > } catch (URISyntaxException e) { > LOG.error("Error when trying to use shell script path specified" > + " in env, path=" + shellScriptPath); > e.printStackTrace(); > // A failure scenario on bad input such as invalid shell script path > // We know we cannot continue launching the container > // so we should release it. > // TODO > numCompletedContainers.incrementAndGet(); > numFailedContainers.incrementAndGet(); > return; > } > {noformat} > ========================== > ========================== > Case 4: > Line: 627, File: > "org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java" > {noformat} > try { > /* keep the master in sync with the state machine */ > this.stateMachine.doTransition(event.getType(), event); > } catch (InvalidStateTransitonException e) { > LOG.error("Can't handle this event at current state", e); > /* TODO fail the application on the failed transition */ > } > {noformat} > The handler of this exception only logs the error. The TODO seems to indicate > it is not sufficient. > This exact same code pattern can also be found at: > Line: 573, File: > "org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java" > ========================== > ========================== > Case 5: empty handler for exception: java.lang.InterruptedException > Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java" > {noformat} > public void join() { > if(proxyServer != null) { > try { > proxyServer.join(); > } catch (InterruptedException e) { > } > } > } > {noformat} > The InterruptedException is completely ignored. As a result, any events > causing this interrupt will be lost. > More info on why InterruptedException shouldn't be ignored: > http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception > This pattern of handling InterruptedException can be found in a few other > places: > Line: 434, File: > org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java > {noformat} > try { > event = eventQueue.take(); > } catch (InterruptedException e) { > LOG.error("Returning, interrupted : " + e); > return; // TODO: Kill RM. > } > {noformat} > Line: 722, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" > Line: 191, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java" > Line: 393, File: > "org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java" > Line: 258, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java" > ========================== > ========================== > Case 6: potential divide by zero > line: 581, File: > org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java > {noformat} > int availableContainers = > node.getAvailableResource().getMemory() / capability.getMemory(); > // TODO: A buggy > // application > // with this > // zero would > // crash the > // scheduler. > {noformat} > Should this potential divide by zero be handled? > ========================== > ========================== > Case 7: > Line: 260, File: > "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java" > {noformat} > } catch (YarnException e) { > // TODO cleanup > return; > } > {noformat} > The error handler simply returns without proper clean up. > ========================== -- This message was sent by Atlassian JIRA (v6.1.5#6160)