date:20100720

Full outer join fails while doing a filter on joined data
-

 Key: PIG-1507
 URL: https://issues.apache.org/jira/browse/PIG-1507
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


The following script produce wrong result:

test1.dat:
1
2
3

test2.dat:
1
2

pig script:
{code}
a = LOAD 'test1.dat' USING PigStorage() AS (d1:int);
b = LOAD 'test2.dat' USING PigStorage() AS (d2:int);
c = JOIN a BY d1 FULL OUTER, b BY d2;
d = FILTER c BY d2 IS NULL;
STORE d INTO 'test.out' USING PigStorage();
{code}

expected:
3

We get:
1
2
3

This is because we erroneously push the filter before full outer join. Similar 
issue is addressed in 
[PIG-1289|https://issues.apache.org/jira/browse/PIG-1289], but we only fix 
left/right outer join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-602) Pass global configurations to UDF

2010-07-20 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-602:
--

Assignee: (was: Alan Gates)

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
 Fix For: 0.8.0


 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1434) Allow casting relations to scalars


 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: (was: ScalarImpl1.patch)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1434) Allow casting relations to scalars


 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: ScalarImpl1.patch

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-07-20 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-480:
--

Assignee: (was: Ying He)

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890384#action_12890384
]

Daniel Dai commented on PIG-1295:
-

Patch looks pretty good. Thanks Gianmarco! Couple of comments:
1.
PigTupleRawComparatorNew:324,332,343,357,367,377,387,399,416,474,483,501,512,etc,
if GeneralizedDataType is not equal, we should throw exception to contain the
error
2. PigTupleRawComparatorNew:455-464, if the comparison of two items is not
equal, we shall return the result without comparing additional items, that's
how we get performance gain
3. I am unable to run TestPigTupleRawComparator.main due to OOM, what is the
speed up after the change?
4. PigTupleRawComparatorNew:132, we shall move the logic of choosing the right
comparator to Pig code, and move comparator into BinSedesTuple and
DefaultTuple. This is part of integration work and let's mark it as the first
thing for phase 2.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
Fix For: 0.8.0

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Multiple successors

2010-07-20 Thread Daniel Dai


Hi, Swati,
The only logical operator can have multiple output is LOSplit. So until 
now, it is safe to assume logical operator only have 1 output except for 
LOSplit.


Daniel

Swati Jain wrote:

I noticed a number of places in the code where the successors of a
LogicalRelationalOperator is accessed as op.successors.get(0). Is it
always the case that logical relational operators (in the new logical
optimizer framework) have only 1 successor? Why dont the rules iterate over
the successors instead of assuming there is a single successor?

An example which shows an LOFilter having multiple successor (correct me if
I am wrong):

A1 = Load(..);
A2 = Load(..);
B = LOFilter(...);
C = LOJoin(A1,B);
D = LOJoin(A2,B);

Thanks!
Swati

[jira] Updated: (PIG-1507) Full outer join fails while doing a filter on joined data


 [ 
https://issues.apache.org/jira/browse/PIG-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1507:


Attachment: PIG-1507-1.patch

 Full outer join fails while doing a filter on joined data
 -

 Key: PIG-1507
 URL: https://issues.apache.org/jira/browse/PIG-1507
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1507-1.patch


 The following script produce wrong result:
 test1.dat:
 1
 2
 3
 test2.dat:
 1
 2
 pig script:
 {code}
 a = LOAD 'test1.dat' USING PigStorage() AS (d1:int);
 b = LOAD 'test2.dat' USING PigStorage() AS (d2:int);
 c = JOIN a BY d1 FULL OUTER, b BY d2;
 d = FILTER c BY d2 IS NULL;
 STORE d INTO 'test.out' USING PigStorage();
 {code}
 expected:
 3
 We get:
 1
 2
 3
 This is because we erroneously push the filter before full outer join. 
 Similar issue is addressed in 
 [PIG-1289|https://issues.apache.org/jira/browse/PIG-1289], but we only fix 
 left/right outer join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1507) Full outer join fails while doing a filter on joined data


 [ 
https://issues.apache.org/jira/browse/PIG-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1507:


Status: Patch Available  (was: Open)

 Full outer join fails while doing a filter on joined data
 -

 Key: PIG-1507
 URL: https://issues.apache.org/jira/browse/PIG-1507
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1507-1.patch


 The following script produce wrong result:
 test1.dat:
 1
 2
 3
 test2.dat:
 1
 2
 pig script:
 {code}
 a = LOAD 'test1.dat' USING PigStorage() AS (d1:int);
 b = LOAD 'test2.dat' USING PigStorage() AS (d2:int);
 c = JOIN a BY d1 FULL OUTER, b BY d2;
 d = FILTER c BY d2 IS NULL;
 STORE d INTO 'test.out' USING PigStorage();
 {code}
 expected:
 3
 We get:
 1
 2
 3
 This is because we erroneously push the filter before full outer join. 
 Similar issue is addressed in 
 [PIG-1289|https://issues.apache.org/jira/browse/PIG-1289], but we only fix 
 left/right outer join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: explicitly close a mr job

2010-07-20 Thread Daniel Dai

You can refer to MrCompiler.startNew. You need to add store to close 
current MapReduceOper, create a new MapReduceOper, add load, then add 
MapReduceOper to MRPlan.


Daniel

Gang Luo wrote:

Hi all,
when compile a physical plan into MR plan, the current rule is to put as many 
operator as possible into the reduce phase of the current mr job. But sometimes 
we want to control over this in physical plan. Say we want to put operator 1 
into reduce phase of current mr job, end it and then put operator 2 into map 
phase of the next mr job (both operator 1  2 are non-blocking). It seems 
inserting store and load operator in physical plan doesn't help. Is there a 
better way to do this than implementing new operators )e.g. starter and ender)


Thanks,
-Gang

[jira] Updated: (PIG-1434) Allow casting relations to scalars


 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: ScalarImpl1.patch

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1434) Allow casting relations to scalars


 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: (was: ScalarImpl1.patch)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Announcing Howl development list

2010-07-20 Thread Jeff Hammerbacher


  A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl

 A howldev mailing list has been set up on Yahoo! groups for discussions on
 Howl.  You can subscribe by sending mail to
 howldev-subscr...@yahoogroups.com.  We plan on putting the code on github
 in a read only repository.  It will be a few more days before we get there.
  It will be announced on the list when it is.


Awesome, thanks Alan!

[jira] Created: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6

Make 'docs' target (forrest) work with Java 1.6
---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach


FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
Java 1.6
The same ticket also suggests a workaround: disabling sitemap and stylesheet 
validation
by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1509) Add .gitignore file

Add .gitignore file
---

 Key: PIG-1509
 URL: https://issues.apache.org/jira/browse/PIG-1509
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Carl Steinbach


Add a .gitignore file (equivalent to svn:ignore) for those using git-svn.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6


 [ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated PIG-1508:


Status: Patch Available  (was: Open)

 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6


 [ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated PIG-1508:


Attachment: PIG-1508.patch.txt

PIG-1508.patch.txt:
* set forrest.validate.sitemap=false in forrest.properties
* Remove java5 specific settings in build.xml
* Remove java5 specific settings in test-patch.sh


 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1509) Add .gitignore file


 [ 
https://issues.apache.org/jira/browse/PIG-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated PIG-1509:


Status: Patch Available  (was: Open)

 Add .gitignore file
 ---

 Key: PIG-1509
 URL: https://issues.apache.org/jira/browse/PIG-1509
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Carl Steinbach
 Attachments: PIG-1509.patch.txt


 Add a .gitignore file (equivalent to svn:ignore) for those using git-svn.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1509) Add .gitignore file