from:"Ying He \(JIRA\)"

[jira] Created: (PIG-1061) exception is thrown when there is bincond in foreach after group by

2009-10-29 Thread Ying He (JIRA)

exception is thrown when there is bincond in foreach after group by
---

 Key: PIG-1061
 URL: https://issues.apache.org/jira/browse/PIG-1061
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


the following statement throws exception 

A = load 'a.txt' as (id, c);
B = group A by id;
C = foreach B generate group, COUNT(A)>0?'a','b';

parser doesn't recognize the UDF in the binconf. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-10-30 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771979#action_12771979
 ] 

Ying He commented on PIG-1062:
--

I would suggest to add the total number of tuples of a split into the last 
sample as a field. All other sample tuples can have this field as NULL. Then in 
PartitionSkewedKey.calculateReducers, it can add up this field from all the 
samples to get total number of tuples from input.

If we use a separate tuple with different format to represent total number of 
tuples, that would involve a bigger change. The sampling job currently add an 
"all" to all samples to group them into one bag, and then sort the tuples by 
keys. If tuples are of different format, the execution plan has to be changed 
to be more complex to deal with these special tuples.

> load-store-redesign branch: change SampleLoader and subclasses to work with 
> new LoadFunc interface 
> ---
>
> Key: PIG-1062
> URL: https://issues.apache.org/jira/browse/PIG-1062
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out 
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
> be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775054#action_12775054
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775057#action_12775057
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775058#action_12775058
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775059#action_12775059
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775184#action_12775184
 ] 

Ying He commented on PIG-979:
-

Alan, thanks for the feedback.

1. A test case is already created to test mix of accumulator UDF with regular 
UDF, it is in testAccumBasic().

2. The optimizer can't be applied when inner is set to POPackage, because if an 
inner is set, POPackage checks the bag for that input is NULL, if it is, 
POPackage returns NULL. This can only be done when all the tuples are retrieved 
and put into a bag.

3 & 4, will fix that

5. needs performance testing.

6. The reducer get results from POPackage and pass it to root, which is 
POForEach, to process. From POForEach perspective, it gets a tuple with bags in 
it from POPackage. Then POForEach retrieves tuples off iterator and pass to 
UDFs in multiple cycles. Because only POPackage knows how to read tuples out of 
iterator and put in proper bags, AccumulativeTupleBuffer and AccumulativeBag 
are created to communicate between POPackage and POForEach. Every time 
POForEach calls getNextBatch() on AccumulativeTupleBuffer, it in effects calls 
inner class of POPackage to retrieve tuples out of iterator.

POPackage can not be the one to block the reading of tuples, because it is only 
called once from reducer. I also thought of changing reducer to call POPackage 
multiple times to process each batch of data, then it becomes tricky to 
maintain correct states of operators, and all operators in reducer plan would 
have to support partial data, which is not necessary. 

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-979:


Attachment: PIG-979.patch

patch to address Alan's comments. 

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch, PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776760#action_12776760
 ] 

Ying He commented on PIG-979:
-

performance tests doesn't show noticeable difference between trunk and 
accumulator patch when calling no-accumulator udfs.

the script to test performance is:

register /homes/yinghe/pig_test/pigperf.jar;
register /homes/yinghe/pig_test/string.jar;
register /homes/yinghe/pig_test/piggybank.jar;

A = load '/user/pig/tests/data/pigmix_large/page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, 
timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links);

B = foreach A generate user, 
org.apache.pig.piggybank.evaluation.string.STRINGCAT(user, ip_addr) as id;

C = group B by id parallel 10;

D = foreach C {
generate group, string.BagCount2(B)*string.ColumnLen2(B, 0);
}

store D into 'test2';

The input data has 100M rows, output has 57M rows, so the UDFs are called 57M 
times.
The result is

 with patch:  5min 14sec
 w/o patch:   5min 17sec

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch, PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-12 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777090#action_12777090
 ] 

Ying He commented on PIG-979:
-

the release audit warnings are all from html files.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Alan Gates
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-979.patch, PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1118) expression with aggregate functions returning null, with accumulate interface

2009-12-02 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1118:
-

Attachment: PIG_1118.patch

bug fix.

> expression with aggregate functions returning null, with accumulate interface
> -
>
> Key: PIG-1118
> URL: https://issues.apache.org/jira/browse/PIG-1118
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Ying He
> Fix For: 0.7.0
>
> Attachments: PIG_1118.patch
>
>
> The problem is in trunk . It works fine in 0.6 branch.
> l = load '/tmp/students.txt' as (a : chararray,b : chararray,c : int);
> grunt> g = group l by 1;
> grunt> dump g;
> (1,{(asdfxc,M,23),(qwer,F,21),(uhsdf,M,34),(zxldf,M,21),(qwer,F,23),(oiue,M,54)})
> grunt> f = foreach g generate SUM(l.c), 1 + SUM(l.c) + SUM(l.c);
> grunt> dump f;
> (176L,)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1118) expression with aggregate functions returning null, with accumulate interface

2009-12-02 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785043#action_12785043
 ] 

Ying He commented on PIG-1118:
--

Olga, thank for review.  A unit test is in the patch, TestAccumulator. 

> expression with aggregate functions returning null, with accumulate interface
> -
>
> Key: PIG-1118
> URL: https://issues.apache.org/jira/browse/PIG-1118
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Ying He
> Fix For: 0.7.0
>
> Attachments: PIG_1118.patch
>
>
> The problem is in trunk . It works fine in 0.6 branch.
> l = load '/tmp/students.txt' as (a : chararray,b : chararray,c : int);
> grunt> g = group l by 1;
> grunt> dump g;
> (1,{(asdfxc,M,23),(qwer,F,21),(uhsdf,M,34),(zxldf,M,21),(qwer,F,23),(oiue,M,54)})
> grunt> f = foreach g generate SUM(l.c), 1 + SUM(l.c) + SUM(l.c);
> grunt> dump f;
> (176L,)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-03 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Attachment: PIG_480.patch

patch to use identity map. 

An IdentityMapOptimizer is applied when a MR plan contains at least 2 MRs.  It 
evaluates each MR job, if its reducer uses POStore to dump a tmp file, and the 
mapper of next MR only contains a POLocalRearrange and a POLoad to load the tmp 
file,  then the POLocalRearrange of next mapper is moved up to the reducer of 
this MR, and the mapper of next MR job is changed to use identity map.

In this case, the reducer of the MR job output (key, tuple) pairs to the tmp 
file by using a different OutputFormat, PigBinaryValueOutputFormat.  It uses a 
different record writer to dump data, the format is

delimiter (3 bytes,, 0x01, 0x02, 0x03)
key
length of byte[] for tuple
byte[] for tuple

the next MR job that uses identity map uses a different InputFormat, 
PigBinaryValueInputFormat, which returns a different RecordReader, to read in 
data as (key, tuple) pairs. But the tuple is kept in byte[] form.  The identity 
map does nothing except passing the (key, tuple) through and writing them to 
disk. When reducer picks them up, the tuple is de-serialized  for processing. 

The reason of doing this is performance. Because the tuple reading in and 
writing out of identity map are in byte[] form, we saved a de-serialization and 
serialization of tuples in mapper.

A use case is  following:

a = load 'f' as (id, v);
b = load 's' as (id, v);
c = join a by id, b by id;
d = group c by a::id;
dump d;

this example  contains 2 MR jobs. After optimization, the first job output 
(key, tuple) pairs, and second job uses identity map.


> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-04 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Status: Open  (was: Patch Available)

this patch has a conflict with the new code that just checked in, which results 
in compilation error.

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-04 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Attachment: PIG_480.patch

fix the compilation error.

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-07 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787060#action_12787060
 ] 

Ying He commented on PIG-480:
-

The javac warnings are caused by the references to hadoop deprecated API. The 
release audit warning is for html file.

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1135) skewed join partitioner returns negative partition index

2009-12-08 Thread Ying He (JIRA)

skewed join partitioner returns negative partition index 
-

 Key: PIG-1135
 URL: https://issues.apache.org/jira/browse/PIG-1135
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He
Assignee: Sriranjan Manjunath


Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1135) skewed join partitioner returns negative partition index

2009-12-08 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1135:
-

Description: skewed join returns negative reducer index  (was: Fragmented 
replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> skewed join partitioner returns negative partition index 
> -
>
> Key: PIG-1135
> URL: https://issues.apache.org/jira/browse/PIG-1135
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
>Assignee: Sriranjan Manjunath
>
> skewed join returns negative reducer index

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1135) skewed join partitioner returns negative partition index

2009-12-08 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1135:
-

Status: Patch Available  (was: Open)

> skewed join partitioner returns negative partition index 
> -
>
> Key: PIG-1135
> URL: https://issues.apache.org/jira/browse/PIG-1135
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
>Assignee: Sriranjan Manjunath
> Attachments: PIG_1135.patch
>
>
> skewed join returns negative reducer index

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1135) skewed join partitioner returns negative partition index

2009-12-08 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1135:
-

Attachment: PIG_1135.patch

if reducer index is greater than 128, the index of streaming table becomes 
negative because byte is used as data type.
It is fixed by changing partition index from byte to integer. 

> skewed join partitioner returns negative partition index 
> -
>
> Key: PIG-1135
> URL: https://issues.apache.org/jira/browse/PIG-1135
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
>Assignee: Sriranjan Manjunath
> Attachments: PIG_1135.patch
>
>
> skewed join returns negative reducer index

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Status: Open  (was: Patch Available)

cancel this patch to add new patch to support combiner

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Attachment: PIG_480.patch

add support for combiner

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797412#action_12797412
 ] 

Ying He commented on PIG-480:
-

I did more performance tests. It shows the performance is related to the 
nature of data. If the data is skewed, performance is very bad for 
combiner case. If data is uniform,  the combiner case gets the most 
performance gain.  The test is done by using a join then a group by 
statement.

For skewed data, if I use skewed join, the result is much better.  I 
think the reason of bad performance for skewed data is that because the 
map plan of second job is moved to the reducer of first job. If data is 
skewed, a single reducer has to execute the extra logic for all its 
tuples. While without this patch, that part of logic would be executed 
inside multiple mappers. So we lost parallelism for this.  The more 
skewed the data is, the worse the performance would be. 

1. skewed data
combiner   job 1 job 2 total
patch 7min 53sec  1min 1sec8min 54sec
trunk 4min 43sec  1min 37sec  6min 20sec

combiner and using skewed join
patch1min 55sec  1min 1sec 2min 56sec
trunk1min 44sec  1min 40sec   3min 24sec

no combiner
patch2min 26sec  2min 28sec 4min 54sec
trunk1min 25sec  3min 24sec  4min 49sec

no combiner and using skewed join
patch   1min 17sec  3min 5sec   4min 22sec
trunk59sec   3min 7sec   4min 6sec

2. uniform data
combiner
patch   6min 48sec  3min 43sec10min 31sec
trunk7min 32sec  7min 3sec  14min 35sec

no combiner
patch   1min 25sec  2min 25sec 3min 50sec
trunk   1min 24sec  2min 28sec 3min 52sec

each group of tests may use different data, so don't make cross group 
comparison.


> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-480:


Status: Patch Available  (was: Open)

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-07 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Attachment: lp.patch

initial drop for new logical plan framework

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
> Attachments: lp.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-08 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798273#action_12798273
 ] 

Ying He commented on PIG-1178:
--

yes, the operator plan that Rule.match returned has the same java objects as 
the original plan.  So as Alan said, it's a bit confusing. the plan returned by 
Rule.match is more similar to a view. 

the test cases for the two patterns will be added later.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
> Attachments: lp.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-11 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798880#action_12798880
 ] 

Ying He commented on PIG-1178:
--

the Rule.match() finds a potential match and delegate to 
PatternMatchOperatorPlan to further verify if it matches the pattern. During 
it's verification, the plan is filled with the operators from original plan.  
The PetternMatchOperatorPlan is not visible to rule writers. Rule writers 
should only use OperatorPlan to operate on the matched sub-plans.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: lp.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Attachment: PIG_1178.patch

add test cases to TestExperimentalRule, and fix findbugs problems

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Status: Patch Available  (was: Open)

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-12 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799370#action_12799370
 ] 

Ying He commented on PIG-480:
-

I did some tests with larger data set, and the results are consistent 
with what we saw before.  I didn't run skewed data with no combiner, 
because it kept running out of space.

1. skewed data
combiner  job 1   job 2   total
patch46min 3min 38sec  49min 38sec  
 
trunk   24min 32sec6min 53sec   31min 25sec

combiner and skewed join
patch6min 40sec3min 58sec10min 38sec
trunk8min 41sec8min 32sec17min 13sec

2. uniform data
combiner
patch   13min 18sec   7min 9sec   20min 27sec
trunk   19min 1sec13min 25sec32min 26sec

no combiner
patch  18min 21sec   37min 4sec  55min 25sec
trunk  16min 31sec   40min 3sec  56min 34sec

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-12 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799376#action_12799376
 ] 

Ying He commented on PIG-480:
-

the option to turn it off is already there. Use
-Dopt.identitymap=false 

to turn it off.

> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> ---
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1185) Data bags do not close spill files after using iterator to read tuples

2010-01-12 Thread Ying He (JIRA)

Data bags do not close spill files after using iterator to read tuples
--

 Key: PIG-1185
 URL: https://issues.apache.org/jira/browse/PIG-1185
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


spill files are not closed after reading the tuples from iterator. When large 
number of spill files exists, this can exceed specified max number of open 
files on the system and therefore, cause application failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1185) Data bags do not close spill files after using iterator to read tuples

2010-01-12 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1185:
-

Attachment: PIG_1185.patch

close files are spill file is read out.

> Data bags do not close spill files after using iterator to read tuples
> --
>
> Key: PIG-1185
> URL: https://issues.apache.org/jira/browse/PIG-1185
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: PIG_1185.patch
>
>
> spill files are not closed after reading the tuples from iterator. When large 
> number of spill files exists, this can exceed specified max number of open 
> files on the system and therefore, cause application failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1185) Data bags do not close spill files after using iterator to read tuples

2010-01-12 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799502#action_12799502
 ] 

Ying He commented on PIG-1185:
--

this patch doesn't contain any junit test, because I can't verify the files are 
still used by an application from java, and the name of spill files are not 
available. I've manually checked the files are not used by any process after 
iteration is done.



> Data bags do not close spill files after using iterator to read tuples
> --
>
> Key: PIG-1185
> URL: https://issues.apache.org/jira/browse/PIG-1185
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: PIG_1185.patch
>
>
> spill files are not closed after reading the tuples from iterator. When large 
> number of spill files exists, this can exceed specified max number of open 
> files on the system and therefore, cause application failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1185) Data bags do not close spill files after using iterator to read tuples

2010-01-12 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1185:
-

Status: Patch Available  (was: Open)

> Data bags do not close spill files after using iterator to read tuples
> --
>
> Key: PIG-1185
> URL: https://issues.apache.org/jira/browse/PIG-1185
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: PIG_1185.patch
>
>
> spill files are not closed after reading the tuples from iterator. When large 
> number of spill files exists, this can exceed specified max number of open 
> files on the system and therefore, cause application failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-13 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799859#action_12799859
 ] 

Ying He commented on PIG-1178:
--

a couple questions on expression operators:

1. in ProjectExpression, is it better to change the object variable "input" 
from "int" to "LogicalRelationalOperator" to point to the operator that the 
project expression operates on directly?  And I don't understand why this 
operator needs alias  it references. But if we change input to operator object, 
the alias can be get from the operator.

2. I don't know the purpose of ColumnExpression. Is it to capture operands?  It 
doesn't seem to have any special features. So I am not sure if it is necessary


> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions.patch, lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-14 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800289#action_12800289
 ] 

Ying He commented on PIG-1178:
--

+1

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions.patch, lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-14 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800353#action_12800353
 ] 

Ying He commented on PIG-1178:
--

to answer Daniel's questions:

. In Rule.match, is PatternMatchOperatorPlan only contains leave nodes but not 
edge information? If so, instead of saying "A list of all matched sub-plans", 
can we put more details in the comments?

The returned lists are plans. You can call getPredecessors() or getSuccessors() 
on any node in the plan. The implementation doesn't keep edge information, it 
calls the base plan for this information and return the operators that are in 
this sub-plan. So looking from outside, it is a plan, it's just read-only, and 
method to update the plan would throw an exception.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1195) POSort should take care of sort order

2010-01-19 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802439#action_12802439
 ] 

Ying He commented on PIG-1195:
--

+1

> POSort should take care of sort order
> -
>
> Key: PIG-1195
> URL: https://issues.apache.org/jira/browse/PIG-1195
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.6.0
>
> Attachments: PIG-1195-1.patch, PIG-1195-2.patch, PIG-1195-3.patch
>
>
> POSort always use ascending order. We shall obey the sort order as specified 
> in the script.
> For example, the following script does not do the right thing if we turn off 
> secondary sort (which means, we will rely on POSort to sort):
> {code}
> A = load 'input' as (a0:int);
> B = group A ALL;
> C = foreach B {
> D = order A by a0 desc;
> generate D;
> };
> dump C;
> {code}
> If we run it using the command line "java -Xmx512m 
> -Dpig.exec.nosecondarykey=true -jar pig.jar 1.pig".
> The sort order for D is ascending.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-20 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Attachment: lp.patch

patch to add relational operator, optimization rules and logical plan migration 
visitor

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-20 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Status: Open  (was: Patch Available)

attached a new patch

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-20 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1178:
-

Status: Patch Available  (was: Open)

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-01-22 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803856#action_12803856
 ] 

Ying He commented on PIG-1178:
--

Alan,thanks for the review.

for 6), the predecessor of LOFilter would be LOJoin,  so all projections would 
have input number  0.  My algorithm is to get field names from column number. 
The field names after join would be like A::id, B::id,   And findCommon() is to 
search for the longest prefix of these fields, to push filter to be after that 
alias. For example, if field names are A::id, and A::value, the filter is 
pushed after A.  if field names are D::A::id and D::A::value, the filter can be 
pushed after D, then pushed further to be after A.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1202) explain plan throws out exception

2010-01-26 Thread Ying He (JIRA)

explain plan throws out exception 
--

 Key: PIG-1202
 URL: https://issues.apache.org/jira/browse/PIG-1202
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


run the following script

a = load 's/part*' as (id:int, f:chararray);
b = load 's/part*' as (id:int, f:chararray);
c = join a by id, b by id;
d = filter c by a::f == 'apple';
explain d;

got error message:
ERROR 1067: Unable to explain alias d

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1222) cast ends up with NULL value

2010-02-04 Thread Ying He (JIRA)

cast ends up with NULL value


 Key: PIG-1222
 URL: https://issues.apache.org/jira/browse/PIG-1222
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


I want to generate data with bags, so I did this,

take a simple text file b.txt

100  apple
200  orange
300  pear
400  apple

then run query:

a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();

then run another query to load data generated from previous step.

a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;

then I got the following result:

(,100,apple)
(,100,apple)
(,200,orange)
(,200,apple)
(,300,strawberry)
(,300,pear)
(,400,pear)

the value for id is gone.  If there is no cast, then the result is correct.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1225) It's better for POPartitionRearrange to use List instead of DataBag to hold duplicated tuples for partitioned keys

2010-02-05 Thread Ying He (JIRA)

It's better for POPartitionRearrange to use List instead of DataBag to hold 
duplicated tuples for partitioned keys
--

 Key: PIG-1225
 URL: https://issues.apache.org/jira/browse/PIG-1225
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He


In POPartionRearrange, a tuple from streaming table is duplicated multiple 
times to send to reducers that partion table is partioned into. It uses 
DefaultDataBag right now, it would be better to use java List.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-02-10 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832164#action_12832164
 ] 

Ying He commented on PIG-1178:
--

for the annotation resetting, I think it can be implemented as a 
PlanTransformListener. The listener has access to the plan and can reset every 
node, given the order is not important.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-02-10 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832313#action_12832313
 ] 

Ying He commented on PIG-1178:
--

Here is my thoughts to use this framework to implement PruneColumns.

1. Separate prune columns and prune map keys into 2 rules. Current 
implementation mixed them in one class. It's better to separate them to make 
each rule simpler. 

2. The prune column rule can be implemented by creating a new visitor. This 
visitor is called from transform(), and it visits every 
LogicalRelationalOperator by reverse dependency order. Each 
visit(LogicalRelationalOperator) calculates the required output uids  by 
combining the input uids from it successors. If a node is the sink of the plan, 
the output uids are retrieved from its schema. The input uids are calculated 
from its output uids by looking into the expression plan(s) of this operator.  
If an output uid is derived from other uids, the source uids should be put into 
input uids. For example, a+b is from a & b. The input uids should keep the uid 
of a & b.   Each operator should consider its logical meanings when calculating 
input uids from output uids. For example, for LOCross, the input uids should 
contain at least one field from each input. 

The input uids and output uids can be added into the operator as annotations.

3. After step 2, use another visitor to go over the plan again by dependency 
order to prune the columns.  This can be done by reading out the input and 
output uids for each node.

4. I think it's ok to implement prune column and prune map key as regular rule. 
They just need to overwrite the match().

public List match(OperatorPlan plan) {
List ll = new ArrayList LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-16 Thread Ying He (JIRA)

Accumulator is turned on when a map is used with a non-accumulative UDF
---

 Key: PIG-1241
 URL: https://issues.apache.org/jira/browse/PIG-1241
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


Exception is thrown for a script like the following:

register /homes/yinghe/owl/string.jar;
a = load 'a.txt' as (id, url);
b = group  a by (id, url);
c = foreach b generate  COUNT(a), (CHARARRAY) string.URLPARSE(group.url)#'url';
dump c;

In this query, URLPARSE() is not accumulative, and it returns a map. 

The accumulator optimizer failed to check UDF in this case, and tries to run 
the job in accumulative mode. ClassCastException is thrown when trying to cast 
UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-16 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1241:
-

Attachment: accum.patch

patch to check UDF when it's with map operation

> Accumulator is turned on when a map is used with a non-accumulative UDF
> ---
>
> Key: PIG-1241
> URL: https://issues.apache.org/jira/browse/PIG-1241
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: accum.patch
>
>
> Exception is thrown for a script like the following:
> register /homes/yinghe/owl/string.jar;
> a = load 'a.txt' as (id, url);
> b = group  a by (id, url);
> c = foreach b generate  COUNT(a), (CHARARRAY) 
> string.URLPARSE(group.url)#'url';
> dump c;
> In this query, URLPARSE() is not accumulative, and it returns a map. 
> The accumulator optimizer failed to check UDF in this case, and tries to run 
> the job in accumulative mode. ClassCastException is thrown when trying to 
> cast UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-16 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1241:
-

Status: Patch Available  (was: Open)

> Accumulator is turned on when a map is used with a non-accumulative UDF
> ---
>
> Key: PIG-1241
> URL: https://issues.apache.org/jira/browse/PIG-1241
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: accum.patch
>
>
> Exception is thrown for a script like the following:
> register /homes/yinghe/owl/string.jar;
> a = load 'a.txt' as (id, url);
> b = group  a by (id, url);
> c = foreach b generate  COUNT(a), (CHARARRAY) 
> string.URLPARSE(group.url)#'url';
> dump c;
> In this query, URLPARSE() is not accumulative, and it returns a map. 
> The accumulator optimizer failed to check UDF in this case, and tries to run 
> the job in accumulative mode. ClassCastException is thrown when trying to 
> cast UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-17 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834955#action_12834955
 ] 

Ying He commented on PIG-1241:
--

no, by default it is on.

boolean isAccum = 
"true".equalsIgnoreCase(pc.getProperties().getProperty("opt.accumulator","true"));

means if "opt.accumulator" is not present, the default value is "true"

> Accumulator is turned on when a map is used with a non-accumulative UDF
> ---
>
> Key: PIG-1241
> URL: https://issues.apache.org/jira/browse/PIG-1241
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
> Attachments: accum.patch
>
>
> Exception is thrown for a script like the following:
> register /homes/yinghe/owl/string.jar;
> a = load 'a.txt' as (id, url);
> b = group  a by (id, url);
> c = foreach b generate  COUNT(a), (CHARARRAY) 
> string.URLPARSE(group.url)#'url';
> dump c;
> In this query, URLPARSE() is not accumulative, and it returns a map. 
> The accumulator optimizer failed to check UDF in this case, and tries to run 
> the job in accumulative mode. ClassCastException is thrown when trying to 
> cast UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-792) Support skewed join in pig

2009-05-22 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-792:


Attachment: RandomSampleLoader.java

Add disk size of the tuple as the last field. This can be used to estimate how 
many tuples in the file.

> Support skewed join in pig
> --
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: RandomSampleLoader.java
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-67) FileLocalizer doesn't work on reduce sise

2009-05-22 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-67:
---

Attachment: FileLocalizer.java

get JobConf from PigMapReduce class  so that  reducers can operate on files as 
well.

> FileLocalizer doesn't work on reduce sise
> -
>
> Key: PIG-67
> URL: https://issues.apache.org/jira/browse/PIG-67
> Project: Pig
>  Issue Type: Bug
>Reporter: Utkarsh Srivastava
> Attachments: FileLocalizer.java
>
>
> FileLocalizer.openDFSFile() does not work on the reduce side. This is 
> probably because FileLocalizer uses PigRecordReader which exists only on the 
> map task.
> The correct solution will be for FileLocalizer to have a hadoop conf that is 
> initialized by the reduce task on the reduce side, and the pig record reader 
> on the map side.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-792) Support skewed join in pig

2009-05-22 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-792:


Attachment: (was: RandomSampleLoader.java)

> Support skewed join in pig
> --
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-792) Support skewed join in pig

2009-05-22 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-792:


Comment: was deleted

(was: Add disk size of the tuple as the last field. This can be used to 
estimate how many tuples in the file.)

> Support skewed join in pig
> --
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-67) FileLocalizer doesn't work on reduce sise

2009-05-22 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-67:
---

Attachment: (was: FileLocalizer.java)

> FileLocalizer doesn't work on reduce sise
> -
>
> Key: PIG-67
> URL: https://issues.apache.org/jira/browse/PIG-67
> Project: Pig
>  Issue Type: Bug
>Reporter: Utkarsh Srivastava
>
> FileLocalizer.openDFSFile() does not work on the reduce side. This is 
> probably because FileLocalizer uses PigRecordReader which exists only on the 
> map task.
> The correct solution will be for FileLocalizer to have a hadoop conf that is 
> initialized by the reduce task on the reduce side, and the pig record reader 
> on the map side.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-17 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732766#action_12732766
 ] 

Ying He commented on PIG-792:
-

For MPCompiler, the job parallelism is reset to deal with situation when 
parallelism is not specified. In this case, sampling process uses (0.9 * 
default reducer) as the total number of reducers when allocating reducers to 
skewed keys. So the next MR job should use it as parallelism.  If parallelism 
is specified, the rp returned from sampling process is equal to the original 
value of op.

the format of sampling output file is documented in SkewedPartitioner

POSkewedJoinFileSetter is removed, the logic is added into SampleOptimizer

MapReduceOper keeps the file name of the sampling, so that MapReduceLauncher 
can set this file name into the jobconf of the join job.

> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-200) Pig Performance Benchmarks

2009-08-03 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-200:


Attachment: perf.hadoop.patch

perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It 
should be installed on top of perf.patch. 

The design doc is here.
http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop

> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
>  Issue Type: Task
>Reporter: Amir Youssefi
> Attachments: generate_data.pl, perf.hadoop.patch, perf.patch
>
>
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-200) Pig Performance Benchmarks

2009-08-03 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738609#action_12738609
 ] 

Ying He commented on PIG-200:
-

doc for DataGenerator in hadoop mode is here: 
http://wiki.apache.org/pig/DataGeneratorHadoop

> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
>  Issue Type: Task
>Reporter: Amir Youssefi
> Attachments: generate_data.pl, perf.hadoop.patch, perf.patch
>
>
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-929) Default value of memusage for skewed join is not correct

2009-08-21 Thread Ying He (JIRA)

Default value of memusage for skewed join is not correct


 Key: PIG-929
 URL: https://issues.apache.org/jira/browse/PIG-929
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He


Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-929) Default value of memusage for skewed join is not correct

2009-08-21 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-929:


Attachment: memusage.patch

change the default value of memusage for skewed join from 0.5 to 0.3.

> Default value of memusage for skewed join is not correct
> 
>
> Key: PIG-929
> URL: https://issues.apache.org/jira/browse/PIG-929
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: memusage.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-10 Thread Ying He (JIRA)

Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
---

 Key: PIG-954
 URL: https://issues.apache.org/jira/browse/PIG-954
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He


Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-10 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753888#action_12753888
 ] 

Ying He commented on PIG-954:
-

the sampling job fails when pig.skewedjoin.reduce.memusage is not configured in 
pig property file. 

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-10 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Attachment: PIG-954.patch

use default value if pig.skewedjoin.reduce.memusage is not configured in pig 
property file

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-10 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Attachment: PIG-954.patch

use final variable to define the default value of pig.skewedjoin.reduce.memusage

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-10 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Attachment: (was: PIG-954.patch)

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-929) Default value of memusage for skewed join is not correct

2009-09-10 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753904#action_12753904
 ] 

Ying He commented on PIG-929:
-

this patch is no longer required, as PIG-954 contains the fix for this.

> Default value of memusage for skewed join is not correct
> 
>
> Key: PIG-929
> URL: https://issues.apache.org/jira/browse/PIG-929
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: memusage.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

Skewed join generates  incorrect results 
-

 Key: PIG-955
 URL: https://issues.apache.org/jira/browse/PIG-955
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He


Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Attachment: PIG-955.patch

use tuple type to lookup skewed key map 

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754319#action_12754319
 ] 

Ying He commented on PIG-955:
-

the sampling process generated a file which contains skewed keys and their  
pre-allocated reducer indexes. Each (key, beginning index, ending index) is 
stored as a tuple.

during join process, this file is loaded by SkewedPartitioner as lookup table. 
For tuples from partition table, its key is matched against this lookup table, 
if match is found, it returns a value in range of [beginning index, ending 
index] in round robin fashion. If no match found, it then use hash() to 
calculate index.

the problem is  in SkewedPartitioner, when looking up the table, the 
PigNullableWritable format of input tuple is used, while the lookup table uses 
Pig type Tuple as keys. Therefore,  no match is found. The indexes are 
calculated using hash() even for skewed keys.  This causes the data for this 
key all goes to the same reducer. 

But for streaming table,  if key is skewed key, each tuple is replicated  to 
each reducer that are pre-allocated during sampling process.

Because the reducer indexes are calculated wrong for skewed keys in partition 
table, tuples from first table are sent to wrong reducers,  if it doesn't fall 
into its pre-calculated index range, the join with second table ends up with 
empty data set for that key.  The query still appears successfully, but it has 
data loss.

The fix is to change SkewedPartitioner to use correct object type to lookup 
skewed key tables



> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Attachment: PIG-954.patch2

add JUnit test

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Description: SkewedPartitioner doesn't partition the skewed keys in 
partition table (first table) correctly. This can cause data loss.  (was: 
SkewedPartitioner doesn't the skewed keys in partition table correctly. This 
can cause data loss.)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Description: SkewedPartitioner doesn't the skewed keys in partition table 
correctly. This can cause data loss.  (was: Fragmented replicated join has a 
few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't the skewed keys in partition table correctly. This 
> can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754370#action_12754370
 ] 

Ying He commented on PIG-955:
-

This is not related to replicate join. The original description is misleading. 
It came  from the the JIRA that this one is cloned from. I've updated it to the 
correct one.

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Description: query fails if pig.skewedjoin.reduce.memusage is not 
configured.   (was: Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> query fails if pig.skewedjoin.reduce.memusage is not configured. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-929) Default value of memusage for skewed join is not correct

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-929:


Description: default value pig.skewedjoin.reduce.memusage , which is used 
in skewed join, should be set to 0.3  (was: Fragmented replicated join has a 
few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Default value of memusage for skewed join is not correct
> 
>
> Key: PIG-929
> URL: https://issues.apache.org/jira/browse/PIG-929
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: memusage.patch
>
>
> default value pig.skewedjoin.reduce.memusage , which is used in skewed join, 
> should be set to 0.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Attachment: PIG-955.patch2

add Junit test

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch, PIG-955.patch2
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-961) Integration with Hadoop 21

2009-09-15 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-961:


Attachment: PIG-961.patch

patch for pig to work with hadoop 21with new API

> Integration with Hadoop 21
> --
>
> Key: PIG-961
> URL: https://issues.apache.org/jira/browse/PIG-961
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-961.patch
>
>
> Hadoop 21 is not yet released but we know that switch to new MR API is coming 
> there. This JIRA is for early integration with this API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-961) Integration with Hadoop 21

2009-09-15 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-961:


Attachment: hadoop21.jar

hadoop jar file used by pig

> Integration with Hadoop 21
> --
>
> Key: PIG-961
> URL: https://issues.apache.org/jira/browse/PIG-961
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: hadoop21.jar, PIG-961.patch
>
>
> Hadoop 21 is not yet released but we know that switch to new MR API is coming 
> there. This JIRA is for early integration with this API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-961) Integration with Hadoop 21

2009-09-15 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755748#action_12755748
 ] 

Ying He commented on PIG-961:
-

there are a few problems while porting pig to hadoop 21 new API.

1.  When running a Task (map or reduce), Two context are created, a JobContext 
and a TaskAttempContext. Each context has its own copy of job config. So a 
property set from one context can't be accessed from another context.  I have 
to make the following change to JobContext class to make pig work:

the original code makes a copy of input config:
public JobContext(Configuration conf, JobID jobId) {
   this.conf = new org.apache.hadoop.mapred.JobConf(conf);  
   this.jobId = jobId;
  }

I changed it to share the config object
public JobContext(Configuration conf, JobID jobId) {
  if (conf instanceof org.apache.hadoop.mapred.JobConf) {
  this.conf = (org.apache.hadoop.mapred.JobConf)conf;
  }else{
  this.conf = new org.apache.hadoop.mapred.JobConf(conf);
  }
this.jobId = jobId;
  }

2. The "Reporter" object is not visible from JobContext. There is no access to 
reporter.incrCounter()

3. JobConf is obsolete. There is not access to some convenient methods such as 
getUser(),  Instead, I have to use config.get("user.name")

> Integration with Hadoop 21
> --
>
> Key: PIG-961
> URL: https://issues.apache.org/jira/browse/PIG-961
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: hadoop21.jar, PIG-961.patch
>
>
> Hadoop 21 is not yet released but we know that switch to new MR API is coming 
> there. This JIRA is for early integration with this API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-24 Thread Ying He (JIRA)

Need a databag that does not register with SpillableMemoryManager and spill 
data pro-actively
-

 Key: PIG-975
 URL: https://issues.apache.org/jira/browse/PIG-975
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Ying He
Assignee: Pradeep Kamath
 Fix For: 0.2.0


Currently whenever Combiner is used in pig, in the map, the 
POPrecombinerLocalRearrange operator puts the single "value" tuple 
corresponding to a key into a DataBag and passes this to the foreach which is 
being combined. This will generate as many bags as there are input records. 
These bags all will have a single tuple and hence are small and should not need 
to be spilt to disk. However since the bags are created through the BagFactory 
mechanism, each bag creation is registered with the SpillableMemoryManager and 
a weak reference to the bag is stored in a linked list. This linked list grows 
really big over time causing unnecessary Garbage collection runs. This can be 
avoided by having a simple lightweight implementation of the DataBag interface 
to store the single tuple in a bag. Also these SingleTupleBags should be 
created without registering with the spillableMemoryManager. Likewise the bags 
created in POCombinePackage are supposed to fit in Memory and not spill. Again 
a NonSpillableDataBag implementation of DataBag interface which does not 
register with the SpillableMemoryManager would help.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-24 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Description: POPackage uses DefaultDataBag during reduce process to hold 
data. It is registered with SpillableMemoryManager and prone to 
OutOfMemoryException.  It's better to pro-actively managers the usage of the 
memory. The bag fills in memory to a specified amount, and dump the rest the 
disk.  The amount of memory to hold tuples is configurable. This can avoid out 
of memory error.  (was: Currently whenever Combiner is used in pig, in the map, 
the POPrecombinerLocalRearrange operator puts the single "value" tuple 
corresponding to a key into a DataBag and passes this to the foreach which is 
being combined. This will generate as many bags as there are input records. 
These bags all will have a single tuple and hence are small and should not need 
to be spilt to disk. However since the bags are created through the BagFactory 
mechanism, each bag creation is registered with the SpillableMemoryManager and 
a weak reference to the bag is stored in a linked list. This linked list grows 
really big over time causing unnecessary Garbage collection runs. This can be 
avoided by having a simple lightweight implementation of the DataBag interface 
to store the single tuple in a bag. Also these SingleTupleBags should be 
created without registering with the spillableMemoryManager. Likewise the bags 
created in POCombinePackage are supposed to fit in Memory and not spill. Again 
a NonSpillableDataBag implementation of DataBag interface which does not 
register with the SpillableMemoryManager would help.
)

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Pradeep Kamath
> Fix For: 0.2.0
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-24 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch

implement a new bag and use it in POPackage

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Pradeep Kamath
> Fix For: 0.2.0
>
> Attachments: PIG-975.patch
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-24 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch2

remove System.out.println

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Pradeep Kamath
> Fix For: 0.2.0
>
> Attachments: PIG-975.patch, PIG-975.patch2
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-24 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759299#action_12759299
 ] 

Ying He commented on PIG-975:
-

Answer to Olga's questions:

1. The synchronization can be removed. 
2. Memory fraction is configurable. the property name is 
pig.cachedbag.memusage, default value is 0.5
3. The first 100 tuples are used to calculate tuple size in memory to determine 
how many tuples can fit into the configured memusage. It's not the number of 
tuples kept in memory

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: PIG-975.patch, PIG-975.patch2
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch3

remove synchronization

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: PIG-975.patch, PIG-975.patch2, PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: internalbag.xls

performance numbers 

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759681#action_12759681
 ] 

Ying He commented on PIG-975:
-

I think this is too implementation specific to expose to end user. Frankly, I 
don't think user cares which class we use for the data bags. 

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch4

Add switch to old bag.  Setting property pig.cachedbag.type=default  would 
switch to old default bag. If not specified, use InternalCachedBag.l

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3, PIG-975.patch4
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-961) Integration with Hadoop 21

2009-10-05 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-961:


Attachment: PIG-961.patch2

update to latest code in trunk

> Integration with Hadoop 21
> --
>
> Key: PIG-961
> URL: https://issues.apache.org/jira/browse/PIG-961
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: hadoop21.jar, PIG-961.patch, PIG-961.patch2
>
>
> Hadoop 21 is not yet released but we know that switch to new MR API is coming 
> there. This JIRA is for early integration with this API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1000) InternalCachedBag.java generates javac warning and findbug warning

2009-10-08 Thread Ying He (JIRA)

InternalCachedBag.java generates javac warning and findbug warning
--

 Key: PIG-1000
 URL: https://issues.apache.org/jira/browse/PIG-1000
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.4.0
Reporter: Ying He
Assignee: Ying He
 Fix For: 0.6.0


POPackage uses DefaultDataBag during reduce process to hold data. It is 
registered with SpillableMemoryManager and prone to OutOfMemoryException.  It's 
better to pro-actively managers the usage of the memory. The bag fills in 
memory to a specified amount, and dump the rest the disk.  The amount of memory 
to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1000) InternalCachedBag.java generates javac warning and findbug warning

2009-10-08 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1000:
-

Attachment: PIG-1000.patch

fix javac warning and findbug warning

> InternalCachedBag.java generates javac warning and findbug warning
> --
>
> Key: PIG-1000
> URL: https://issues.apache.org/jira/browse/PIG-1000
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1000.patch
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1000) InternalCachedBag.java generates javac warning and findbug warning

2009-10-08 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1000:
-

Description: patch submitted by PIG-975 generates javac warning and findbug 
warning  (was: POPackage uses DefaultDataBag during reduce process to hold 
data. It is registered with SpillableMemoryManager and prone to 
OutOfMemoryException.  It's better to pro-actively managers the usage of the 
memory. The bag fills in memory to a specified amount, and dump the rest the 
disk.  The amount of memory to hold tuples is configurable. This can avoid out 
of memory error.)
 Patch Info: [Patch Available]

> InternalCachedBag.java generates javac warning and findbug warning
> --
>
> Key: PIG-1000
> URL: https://issues.apache.org/jira/browse/PIG-1000
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1000.patch
>
>
> patch submitted by PIG-975 generates javac warning and findbug warning

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-10-20 Thread Ying He (JIRA)

explain and dump not working with two UDFs inside inner plan of foreach
---

 Key: PIG-1030
 URL: https://issues.apache.org/jira/browse/PIG-1030
 Project: Pig
  Issue Type: Bug
Reporter: Ying He


this scprit does not work

register /homes/yinghe/owl/string.jar;
a = load '/user/yinghe/a.txt' as (id, color);
b = group a all;
c = foreach b {
d = distinct a.color;
generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
}

the udfs are regular, not algebraic.

then if I call  "dump c;" or "explain c", I would get  this error message.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with 
single leaf. Found 2 leaves.

The error only occurs forn the first time, after getting this error, if I call 
"dump c" or "explain c" again, it would succeed.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-10-20 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1030:
-

Description: 
this scprit does not work

register /homes/yinghe/owl/string.jar;
a = load '/user/yinghe/a.txt' as (id, color);
b = group a all;
c = foreach b {
d = distinct a.color;
generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
}

the udfs are regular, not algebraic.

then if I call  "dump c;" or "explain c", I would get  this error message.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with 
single leaf. Found 2 leaves.

The error only occurs for the first time, after getting this error, if I call 
"dump c" or "explain c" again, it would succeed.




  was:
this scprit does not work

register /homes/yinghe/owl/string.jar;
a = load '/user/yinghe/a.txt' as (id, color);
b = group a all;
c = foreach b {
d = distinct a.color;
generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
}

the udfs are regular, not algebraic.

then if I call  "dump c;" or "explain c", I would get  this error message.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with 
single leaf. Found 2 leaves.

The error only occurs forn the first time, after getting this error, if I call 
"dump c" or "explain c" again, it would succeed.





> explain and dump not working with two UDFs inside inner plan of foreach
> ---
>
> Key: PIG-1030
> URL: https://issues.apache.org/jira/browse/PIG-1030
> Project: Pig
>  Issue Type: Bug
>Reporter: Ying He
>
> this scprit does not work
> register /homes/yinghe/owl/string.jar;
> a = load '/user/yinghe/a.txt' as (id, color);
> b = group a all;
> c = foreach b {
> d = distinct a.color;
> generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
> }
> the udfs are regular, not algebraic.
> then if I call  "dump c;" or "explain c", I would get  this error message.
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan 
> with single leaf. Found 2 leaves.
> The error only occurs for the first time, after getting this error, if I call 
> "dump c" or "explain c" again, it would succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-21 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1037:
-

Attachment: PIG-1037.patch

first cut of patch for initial testing purpose. regression tests are not done 
yet. It may contain bugs.

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-23 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1037:
-

Attachment: PIG-1037.patch2

fix javac and findbugs warnings

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-26 Thread Ying He (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770246#action_12770246
 ] 

Ying He commented on PIG-1037:
--

Alan, thanks for the feedback.

For the calculation of average size, I think the cost to calculate 100 times 
should be very minimal. It shouldn't be noticeable of any performance impact.  
so I'd like to keep it logically correct.  It might be possible of very big 
tuples, such as those with Map type of fields.  

For the comments and synchronization, I am going to make the change.

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-27 Thread Ying He (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-1037:
-

Attachment: PIG-1037.patch3

fix the comments and remove synchronization

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 >

100 matches

Mail list logo