subject:"\[jira\] Commented\: \(PIG\-1518\) multi file input format for loaders"

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-09-09 Thread Olga Natkovich (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907805#action_12907805
]

Olga Natkovich commented on PIG-1518:
-

Hi Justin, thanks for the patch!

I don't think we can commit it to 0.7 patch because we have already done the
official 0.7 release and we can't introduce non-backward compatible changes to
this branch.

However, I think it is great to have the patch on the JIRA so that anybody who
is interested in this patch can apply it to their own tree and run with it. We
have done similar things in the past (with hadoop versions) and it worked fine.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch

We frequently run in the situation where Pig needs to deal with small files
in the input. In this case a separate map is created for each file which
could be very inefficient.
It would be greate to have an umbrella input format that can take multiple
files and use them in a single split. We would like to see this working with
different data formats if possible.
There are already a couple of input formats doing similar thing:
MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
with ne Hadoop 20 API.
We at least want to do a feasibility study for Pig 0.8.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031
]

Dmitriy V. Ryaboy commented on PIG-1518:

This is a great feature, thanks Yan.

Could you comment on what the final solution was as far as PigStorage and
OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but
not what the ultimate direction you took was.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903102#action_12903102
 ] 

Yan Zhou commented on PIG-1518:
---

It is not combinable if the loader is a CollectableLoadFunc AND a 
OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc  but not a 
OrderedLoadFunc, it is combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Richard Ding (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600
]

Richard Ding commented on PIG-1518:
---

+1. The patch looks good.

A few of minor points:

* In PigSplit, the method add(InputSplit split) is not used and can be removed
* In MapRedUtil, it would be better to not leave the debug verification code in
the source code
* In PigRecordReader, the code can be simplified if the initNextRecordReader()
from constructor to initialize() method

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518.patch, PIG-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899888#action_12899888
]

Yan Zhou commented on PIG-1518:
---

In summary, the split combination's controllables are through the following jvm
properties:

pig.maxCombinedSplitSize: by default, it is the load filesystem's default block
size. This specifies the maximum combined split size in unit of bytes;

pig.splitCombination: takes values of false and true. The default is
true. false will disable the split combination.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1292#action_1292
 ] 

Mridul Muralidharan commented on PIG-1518:
--

if optimizer is turned off, does this also get turned off ? 
(pig.splitCombination= false).

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900123#action_12900123
 ] 

Yan Zhou commented on PIG-1518:
---

No. It does not work inside an optimizer as logical/physical plans are not 
changed as the other optimizers do.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899445#action_12899445
]

Yan Zhou commented on PIG-1518:
---

Another approach is to mark splits as uncombinable only when necessary.
Specifically, MergeJoinIndexer and the base load in mapside cogroup need to be
excluded from the split combination.

Breaking backward compatinility is probably too much a risk to take. In the
meanwhile, OrderedLoadFunc has a notion of being evolving that will leave
some headroom for future semantic polishes.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899605#action_12899605
 ] 

Yan Zhou commented on PIG-1518:
---

One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM 
boxes is as follows:

Query:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as 
(name, phone, address,
city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';

It creates 3 map/reduce jobs.

No Split Combination:

||Mappers|Reducers|
|number|120|300|
|elapsed time|24s|2m43s|
|number|301|300|
|elapsed time|46s|3m11s|
|number|300|40|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|


With Split Combination:

||mappers|Reducers|
|number|120|300|
|elapsed time|22s|2m49s|
|number|3|300|
|elapsed time|27s|2m46s|
|number|1|40|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899609#action_12899609
 ] 

Yan Zhou commented on PIG-1518:
---

The formatting of the table of the last comment is a bit off: both headers 
should be be right-shifted by one column.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-14 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898648#action_12898648
]

Ashutosh Chauhan commented on PIG-1518:
---

This feature of combining multiple splits should honor OrderedLoadFunc
interface. If loadfunc is implementing that interface, then splits generated by
it should not be combined. However, its not clear why FileInputLoadFunc
implements this interface. AFAIK, split[] returned by getsplits() on
FileInputFormat makes no guarantees that underlying splits will be returned in
ordered fashion. Though, it is a default behavior right now and thus making it
implement OrderedLoadFunc doesnt result in any problem in current
implementation. But it seems there is no real benefit of FileInputLoadFunc
needing to implement it (there is one exception to which I will come later on).
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This
will result in immediate benefit of making this change useful to all the
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage
etc. Dropping of an interface by an implementing class can be seen as backward
incompatible change, but I really doubt if any one cares if PigStorage is
reading splits in an ordered fashion.
Only real victim of this change will be MergeJoin which will stop working with
PigStorage by default. But we have not seen MergeJoin being used with
PigStorage at many places. Second, its anyway is based on assumption of
FileInputFormat which may choose to change behavior in future. Third, solution
of this problem will be straight forward that having other Loader which extends
PigStorage and implements OrderedLoadFunc which can be used to load data for
merge join.

In essence I am arguing to drop OrderedLoadFunc interface from
FileInputLoadFunc so that this feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making
assumptions which may break in presence of this feature.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-13 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898490#action_12898490
]

Yan Zhou commented on PIG-1518:
---

There is a bigger question at hand. The semantics of OrderedLoadFunc is that
the splits are totally ordered. And BinStorage, InterStorage and PigStorage all
implement that interface through FileInputLoadFunc. Since the combination of
splits as conceived here will definitely destroy the split ordering, if the
combination is disabled for these storages, the feature would be virtually
useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for
MergeJoinIndexer's getNext() method, which makes me wonder if the
OrderedLoadFunc can be removed from the FileInputLoadFunc. Semantically,
FileInputLoadFunc should not support the ordering of splits, as Hadoop's
FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can
add that extension on. But the change may incur some backward compatibility
issues.
I'm now soliciting comments in this area.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-12 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897887#action_12897887
 ] 

Yan Zhou commented on PIG-1518:
---

During the merge process, any empty splits will be skipped. Currently empty 
splits will be generated on empty files, which is not necessary at the first 
place.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897368#action_12897368
]

Alan Gates commented on PIG-1518:
-

bq. For mapside cogroup or mapside group by, though, the splits can be combined
because the splits are only required to contain the all duplicate keys per
instance and combination of splits will still preserve that invariant.

You are correct for mapside group, but not mapside cogroup. Mapside cogroup
does require all files being grouped to be processed in an ordered fashion.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897493#action_12897493
 ] 

Yan Zhou commented on PIG-1518:
---

Right, map side cogroup needs the sortness of the input, but just the side 
inputs need the feature to be able to seek on a key; the base input will 
only need presence of all duplicate keys in a mapper. I'll mark the side 
inputs as non-combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-10 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897085#action_12897085
 ] 

Yan Zhou commented on PIG-1518:
---

The pseudo code of the combination op is as follows:

for each node of the nodes (sorted in the order of ascending sizes) {
while the node's split list (sorted in the order of descending sizes) is not 
empty {
find the biggest splits that can be combined with the first split of the list 
of the splits;
if  the accumulated split size is = half of the limit {
  generate a combined split;
  remove the accumulated splits from the node's split list;
  clear the accumulated split list;
} else {
  break;
}
}
}

// leftover combination
for each node of the nodes {
for each split of the node's split list {
  add the split to a leftover list;
}
}

for each split in the leftover list {
if accumulated split size is = limit {
   generate a combined split;
   remove the accumulated splits from the node's split list;
   clear the accumulated split list;
}
if it is the last split in the leftover list {
  try to see if it can be added with an existing combined split;
  if not, generate a combined split on the accumulated splits;
}
}

The complexity is n*log(n) with n being the number of original splits that are 
smaller than the limit.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-02 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894778#action_12894778
]

Yan Zhou commented on PIG-1518:
---

In contrast with Hive, where the CombineFileInputFormat is used to generate
input splits on the underlying storage formats, this PIG's combined splits work
on top of the splits generated by the underlying loaders. In other words,
Hive's input splits are CombineFileSplits that create record readers of
underlying storage formats; while Pig's combined input splits contain
underlying storage's splits.

CombineFileRecordReader would have been reusable if not for its support only in
0.18 and the need of CombineFIleSplit as an argument to its constructor
instead of InputSplit (MAPREDUCE-955).

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-07-30 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894205#action_12894205
]

Yan Zhou commented on PIG-1518:
---

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat, batches
small files on the basis of block locality. For PIG, this umbrella input format
will have to work with the generic input formats for which the block info is
unavailable but the data node and size info are present to let the M/R make
scheduling decisions. In other words, PIG can not
break the original splits to work inside but can just use the original splits
as building block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input
splits so that each combined split's size is bound by a configured limit of,
say, pig.maxsplitsize, with the default value of the HDFS block size of the
file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the
split combination will not be used for any loads that will be used in merge
join. For mapside cogroup or mapside group by, though, the splits can be
combined because the splits are only required to contain the all duplicate keys
per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as
possible. Leftovers will be merged without regarding to the data localities. Of
all the used data nodes, those of less splits will be merged before considering
those of more splits so as to minimize the leftovers on the data nodes of less
splits. On each data node, a greedy approach is adopted so that largest splits
are tried to be merged before smaller ones. This is because smaller splits are
easier merged later among themselves.
As result, in implementation, a sorted list of data hosts (on the number of
splits) of sorted lists (on the split size) of the original splits will be
maintained to efficiently perform the above operations. The complexity should
be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's
getLocations() method produces. Any particular input split's implementation
actually may or may not hold that property. For instance, CombinedInputFormat
will combine
node-local or rack-local blocks into a split. Essentially, this PIG container
input split works on whatever data locality perception the underlying loader
provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit
instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader
will hold a list
of wrapped record readers and not just a single one. Correspondingly
PigRecordReader's nextKeyValue() will use the wrapped record reader in order to
fetch the next values.

Risks include 1) the test verifications may need major changes since this
optimization may cause major ordering changes in results; 2) since
LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward
compatibility issue as PigSplit changes its wrapped input split to the combined
input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal index loader for the right
table in merge join.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

[jira] Commented: (PIG-1518) multi file input format for loaders

18 matches

Site Navigation

Mail list logo

Footer information