[jira] Commented: (PIG-1518) multi file input format for loaders

2010-09-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907805#action_12907805
 ] 

Olga Natkovich commented on PIG-1518:
-

Hi Justin, thanks for the patch!

I don't think we can commit it to 0.7 patch because we have already done the 
official 0.7 release and we can't introduce non-backward compatible changes to 
this branch.

However, I think it is great to have the patch on the JIRA so that anybody who 
is interested in this patch can apply it to their own tree and run with it. We 
have done similar things in the past (with hadoop versions) and it worked fine.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031
 ] 

Dmitriy V. Ryaboy commented on PIG-1518:


This is a great feature, thanks Yan.

Could you comment on what the final solution was as far as PigStorage and 
OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but 
not what the ultimate direction you took was.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903102#action_12903102
 ] 

Yan Zhou commented on PIG-1518:
---

It is not combinable if the loader is a CollectableLoadFunc AND a 
OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc  but not a 
OrderedLoadFunc, it is combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600
 ] 

Richard Ding commented on PIG-1518:
---

+1. The patch looks good.

A few of minor points:

* In PigSplit, the method add(InputSplit split) is not used and can be removed
* In MapRedUtil, it would be better to not leave the debug verification code in 
the source code
* In PigRecordReader, the code can be simplified if the initNextRecordReader() 
from constructor to initialize() method

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899888#action_12899888
 ] 

Yan Zhou commented on PIG-1518:
---

In summary, the split combination's controllables are through the following jvm 
properties:

pig.maxCombinedSplitSize: by default, it is the load filesystem's default block 
size. This specifies the maximum combined split size in unit of bytes;

pig.splitCombination: takes values of false and true. The default is 
true. false will disable the split combination.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1292#action_1292
 ] 

Mridul Muralidharan commented on PIG-1518:
--

if optimizer is turned off, does this also get turned off ? 
(pig.splitCombination= false).

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900123#action_12900123
 ] 

Yan Zhou commented on PIG-1518:
---

No. It does not work inside an optimizer as logical/physical plans are not 
changed as the other optimizers do.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899445#action_12899445
 ] 

Yan Zhou commented on PIG-1518:
---

Another approach is to mark splits as uncombinable only when necessary. 
Specifically, MergeJoinIndexer and the base load in mapside cogroup need to be 
excluded from the split combination. 

Breaking backward compatinility is probably too much a risk to take. In the 
meanwhile, OrderedLoadFunc has a notion of being evolving that will leave 
some headroom for future semantic polishes.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899605#action_12899605
 ] 

Yan Zhou commented on PIG-1518:
---

One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM 
boxes is as follows:

Query:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as 
(name, phone, address,
city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';

It creates 3 map/reduce jobs.

No Split Combination:

||Mappers|Reducers|
|number|120|300|
|elapsed time|24s|2m43s|
|number|301|300|
|elapsed time|46s|3m11s|
|number|300|40|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|


With Split Combination:

||mappers|Reducers|
|number|120|300|
|elapsed time|22s|2m49s|
|number|3|300|
|elapsed time|27s|2m46s|
|number|1|40|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899609#action_12899609
 ] 

Yan Zhou commented on PIG-1518:
---

The formatting of the table of the last comment is a bit off: both headers 
should be be right-shifted by one column.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898648#action_12898648
 ] 

Ashutosh Chauhan commented on PIG-1518:
---

This feature of combining multiple splits should honor OrderedLoadFunc 
interface. If loadfunc is implementing that interface, then splits generated by 
it should not be combined. However, its not clear why FileInputLoadFunc 
implements this interface. AFAIK, split[] returned by getsplits() on 
FileInputFormat makes no guarantees that underlying splits will be returned in 
ordered fashion. Though, it is a default behavior right now and thus making it 
implement OrderedLoadFunc doesnt result in any problem in current 
implementation. But it seems there is no real benefit of FileInputLoadFunc 
needing to implement it (there is one exception to which I will come later on). 
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This 
will result in immediate benefit of making this change useful to all the 
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage 
etc. Dropping of an interface by an implementing class  can be seen as backward 
incompatible change, but I really doubt if any one cares if PigStorage is 
reading splits in an ordered fashion. 
Only real victim of this change will be MergeJoin which will stop working with 
PigStorage by default. But we have not seen MergeJoin being used with 
PigStorage at many places. Second, its anyway is based on assumption of 
FileInputFormat which may choose to change behavior in future. Third, solution 
of this problem will be straight forward that having other Loader which extends 
PigStorage and implements OrderedLoadFunc which can be used to load data for 
merge join. 

In essence I am arguing to drop OrderedLoadFunc interface from 
FileInputLoadFunc so that this feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making 
assumptions which may break in presence of this feature.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898490#action_12898490
 ] 

Yan Zhou commented on PIG-1518:
---

There is a bigger question at hand. The semantics of OrderedLoadFunc is that 
the splits are totally ordered. And BinStorage, InterStorage and PigStorage all 
implement that interface through FileInputLoadFunc. Since the combination of 
splits as conceived here will definitely destroy the split ordering, if the 
combination is disabled for these storages, the feature would be virtually 
useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for 
MergeJoinIndexer's getNext() method, which makes me wonder if the 
OrderedLoadFunc can be removed from the FileInputLoadFunc.  Semantically, 
FileInputLoadFunc should not support the ordering of splits, as Hadoop's 
FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can 
add that extension on. But the change may incur some backward compatibility 
issues.
I'm now soliciting comments in this area.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-12 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897887#action_12897887
 ] 

Yan Zhou commented on PIG-1518:
---

During the merge process, any empty splits will be skipped. Currently empty 
splits will be generated on empty files, which is not necessary at the first 
place.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897368#action_12897368
 ] 

Alan Gates commented on PIG-1518:
-

bq. For mapside cogroup or mapside group by, though, the splits can be combined 
because the splits are only required to contain the all duplicate keys per 
instance and combination of splits will still preserve that invariant.

You are correct for mapside group, but not mapside cogroup.  Mapside cogroup 
does require all files being grouped to be processed in an ordered fashion.  

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897493#action_12897493
 ] 

Yan Zhou commented on PIG-1518:
---

Right, map side cogroup needs the sortness of the input, but just the side 
inputs need the feature to be able to seek on a key; the base input will 
only need presence of all duplicate keys in a mapper. I'll mark the side 
inputs as non-combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-10 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897085#action_12897085
 ] 

Yan Zhou commented on PIG-1518:
---

The pseudo code of the combination op is as follows:

for each node of the nodes (sorted in the order of ascending sizes) {
while the node's split list (sorted in the order of descending sizes) is not 
empty {
find the biggest splits that can be combined with the first split of the list 
of the splits;
if  the accumulated split size is = half of the limit {
  generate a combined split;
  remove the accumulated splits from the node's split list;
  clear the accumulated split list;
} else {
  break;
}
}
}

// leftover combination
for each node of the nodes {
for each split of the node's split list {
  add the split to a leftover list;
}
}

for each split in the leftover list {
if accumulated split size is = limit {
   generate a combined split;
   remove the accumulated splits from the node's split list;
   clear the accumulated split list;
}
if it is the last split in the leftover list {
  try to see if it can be added with an existing combined split;
  if not, generate a combined split on the accumulated splits;
}
}

The complexity is n*log(n) with n being the number of original splits that are 
smaller than the limit.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-02 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894778#action_12894778
 ] 

Yan Zhou commented on PIG-1518:
---

In contrast with Hive, where the CombineFileInputFormat is used to generate 
input splits on the underlying storage formats, this PIG's combined splits work 
on top of the splits generated by the underlying loaders. In other words, 
Hive's input splits are CombineFileSplits that create record readers of 
underlying storage formats; while Pig's combined input splits contain 
underlying storage's splits.

CombineFileRecordReader would have been reusable if not for its support only in 
0.18 and the need of  CombineFIleSplit as an argument to its constructor 
instead of InputSplit (MAPREDUCE-955).

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-07-30 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894205#action_12894205
 ] 

Yan Zhou commented on PIG-1518:
---

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
not available but the data node and size info are present to let the M/R make 
scheduling decisions.

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
unavailable but the data node and size info are present to let the M/R make 
scheduling decisions. In other words, PIG can not
break the original splits to work inside but can just use the original splits 
as building block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input 
splits so that each combined split's size is bound by a configured limit of, 
say, pig.maxsplitsize, with the default value of the HDFS block size of the 
file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the 
split combination will not be used for any loads that will be used in merge 
join. For mapside cogroup or mapside group by, though, the splits can be 
combined because the splits are only required to contain the all duplicate keys 
per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as 
possible. Leftovers will be merged without regarding to the data localities. Of 
all the used data nodes, those of less splits will be merged before considering 
those of more splits so as to minimize the leftovers on the data nodes of less 
splits. On each data node,  a greedy approach is adopted so that largest splits 
are tried to be merged before smaller ones. This is because smaller splits are 
easier merged later among themselves. 
As result, in implementation, a sorted list of data hosts (on the number of 
splits) of sorted lists (on the split size) of the original splits will be 
maintained to efficiently perform the above operations. The complexity should 
be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's 
getLocations() method produces. Any particular input split's implementation 
actually may or may not hold that property. For instance, CombinedInputFormat 
will combine 
node-local or rack-local blocks into a split. Essentially, this PIG container 
input split works on whatever data locality perception the underlying loader 
provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit 
instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader 
will hold a list
of wrapped record readers and not just a single one. Correspondingly 
PigRecordReader's nextKeyValue() will use the wrapped record reader in order to 
fetch the next values.

Risks include 1) the test verifications may need major changes since this 
optimization may cause major ordering changes in results; 2) since 
LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward 
compatibility issue as PigSplit changes its wrapped input split to the combined 
input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal  index loader for the right 
table in merge join.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.