[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898648#action_12898648
 ] 

Ashutosh Chauhan commented on PIG-1518:
---------------------------------------

This feature of combining multiple splits should honor OrderedLoadFunc 
interface. If loadfunc is implementing that interface, then splits generated by 
it should not be combined. However, its not clear why FileInputLoadFunc 
implements this interface. AFAIK, split[] returned by getsplits() on 
FileInputFormat makes no guarantees that underlying splits will be returned in 
ordered fashion. Though, it is a default behavior right now and thus making it 
implement OrderedLoadFunc doesnt result in any problem in current 
implementation. But it seems there is no real benefit of FileInputLoadFunc 
needing to implement it (there is one exception to which I will come later on). 
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This 
will result in immediate benefit of making this change useful to all the 
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage 
etc. Dropping of an interface by an implementing class  can be seen as backward 
incompatible change, but I really doubt if any one cares if PigStorage is 
reading splits in an ordered fashion. 
Only real victim of this change will be MergeJoin which will stop working with 
PigStorage by default. But we have not seen MergeJoin being used with 
PigStorage at many places. Second, its anyway is based on assumption of 
FileInputFormat which may choose to change behavior in future. Third, solution 
of this problem will be straight forward that having other Loader which extends 
PigStorage and implements OrderedLoadFunc which can be used to load data for 
merge join. 

In essence I am arguing to drop OrderedLoadFunc interface from 
FileInputLoadFunc so that this feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making 
assumptions which may break in presence of this feature.

> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to