[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898648#action_12898648 ]
Ashutosh Chauhan commented on PIG-1518: --------------------------------------- This feature of combining multiple splits should honor OrderedLoadFunc interface. If loadfunc is implementing that interface, then splits generated by it should not be combined. However, its not clear why FileInputLoadFunc implements this interface. AFAIK, split[] returned by getsplits() on FileInputFormat makes no guarantees that underlying splits will be returned in ordered fashion. Though, it is a default behavior right now and thus making it implement OrderedLoadFunc doesnt result in any problem in current implementation. But it seems there is no real benefit of FileInputLoadFunc needing to implement it (there is one exception to which I will come later on). So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This will result in immediate benefit of making this change useful to all the fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage etc. Dropping of an interface by an implementing class can be seen as backward incompatible change, but I really doubt if any one cares if PigStorage is reading splits in an ordered fashion. Only real victim of this change will be MergeJoin which will stop working with PigStorage by default. But we have not seen MergeJoin being used with PigStorage at many places. Second, its anyway is based on assumption of FileInputFormat which may choose to change behavior in future. Third, solution of this problem will be straight forward that having other Loader which extends PigStorage and implements OrderedLoadFunc which can be used to load data for merge join. In essence I am arguing to drop OrderedLoadFunc interface from FileInputLoadFunc so that this feature is useful for large number of usecases. Yan, you also need to watch out for ReadToEndLoader which is also making assumptions which may break in presence of this feature. > multi file input format for loaders > ----------------------------------- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.