[ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021119#comment-17021119
 ] 

Ahmed Hussein edited comment on TEZ-3391 at 1/22/20 2:38 PM:
-------------------------------------------------------------

I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and 
do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to 
the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task 
creates an array of meta split. This means the code is {{n^2}} space 
complexity. The patch will reduce the space complexity but it each task needs 
to go through the entire meta file.
Finally, the code was not closing the InputStream properly. An exception would 
leak the handler.

[~jeagles], Can you please take a look at the patch and merge it at your 
convenience?


was (Author: ahussein):
I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and 
do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to 
the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task 
creates an array of meta split. This means the code is {{n^2}} space 
complexity. The patch will reduce the space complexity but it each task needs 
to go through the entire meta file.

[~jeagles], Can you please take a look at the patch and merge it at your 
convenience?

> MR split file validation should be done in the AM
> -------------------------------------------------
>
>                 Key: TEZ-3391
>                 URL: https://issues.apache.org/jira/browse/TEZ-3391
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Ahmed Hussein
>            Priority: Major
>         Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
>   We had a case  where Split metadata size exceeded 10000000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to