[ https://issues.apache.org/jira/browse/TEZ-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517398#comment-14517398 ]
Hitesh Shah commented on TEZ-924: --------------------------------- [~rajesh.balamohan] [~sseth] Is this still for 0.7 ? or shoudl we move to 0.8 ? > InputFailedEvent handling for Shuffle > ------------------------------------- > > Key: TEZ-924 > URL: https://issues.apache.org/jira/browse/TEZ-924 > Project: Apache Tez > Issue Type: Bug > Reporter: Siddharth Seth > Priority: Critical > > Shuffle receives batches of Events to process from the AM. The way these > events are sent over to the ShuffleHandlers and the way they're processed - > it's possible that Shuffle will start fetching data from an Event, which is > to be subsequently marked as failed (via an InputFailedEvent) > 1) The AM sends events in batches. An InputFailedEvent for a specific Input > may not be part of the same batch which contained the original event which is > being marked bad. > 2) The ShuffleEventHandler processes the events in each batch one event at a > time - so even if the InputFailedEvent follows - it's possible for Shuffle to > start fetching data from a Failed Input. > The AM needs to change to invalidate Inputs up front - so that related events > don't span batches. Alternately, it needs to apply the InputFailedEvent to > the original event being sent. > The Shuffle itself should process a batch update as a batch - that would > prevent fetchers from starting early even though there may be additional > events for the same host. -- This message was sent by Atlassian JIRA (v6.3.4#6332)