date:20100302

[Zebra] Restrict schema definition for collection
-

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0
 Attachments: zebra.0302

Currently Zebra grammar for schema definition for collection field allows many 
types of definition. To reduce complexity and remove ambiguity, and more 
importantly, to make the meta data more representative of the actual data 
instances, the grammar rules need to be changed. Only a record type is allowed 
and required for collection definition. Thus,  
fieldName:collection(record(c1:int, c2:string)) is legal, while 
fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, 
c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is 
illegal.

This will have some impact on existing Zebra M/R programs or Pig scripts that 
use Zebra. Schema acceptable in previous release now may become illegal because 
of this change. This should be clearly documented.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection


 [ 
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1269:
-

Status: Patch Available  (was: Open)

 [Zebra] Restrict schema definition for collection
 -

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0302


 Currently Zebra grammar for schema definition for collection field allows 
 many types of definition. To reduce complexity and remove ambiguity, and more 
 importantly, to make the meta data more representative of the actual data 
 instances, the grammar rules need to be changed. Only a record type is 
 allowed and required for collection definition. Thus,  
 fieldName:collection(record(c1:int, c2:string)) is legal, while 
 fieldName:collection(c1:int, c2:string), 
 fieldName:collection(f:record(c1:int, c2:string)), 
 fieldName:collection(c1:int), or feildName:collection(int) is illegal.
 This will have some impact on existing Zebra M/R programs or Pig scripts that 
 use Zebra. Schema acceptable in previous release now may become illegal 
 because of this change. This should be clearly documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1262) Additional findbugs and javac warnings

2010-03-02 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840263#action_12840263
 ] 

Olga Natkovich commented on PIG-1262:
-

+1

 Additional findbugs and javac warnings
 --

 Key: PIG-1262
 URL: https://issues.apache.org/jira/browse/PIG-1262
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1262-1.patch


 After a while, we have introduced some new findbugs and javacc warnings. Will 
 fix them in this Jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection


 [ 
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1269:
-

Status: Open  (was: Patch Available)

 [Zebra] Restrict schema definition for collection
 -

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0302


 Currently Zebra grammar for schema definition for collection field allows 
 many types of definition. To reduce complexity and remove ambiguity, and more 
 importantly, to make the meta data more representative of the actual data 
 instances, the grammar rules need to be changed. Only a record type is 
 allowed and required for collection definition. Thus,  
 fieldName:collection(record(c1:int, c2:string)) is legal, while 
 fieldName:collection(c1:int, c2:string), 
 fieldName:collection(f:record(c1:int, c2:string)), 
 fieldName:collection(c1:int), or feildName:collection(int) is illegal.
 This will have some impact on existing Zebra M/R programs or Pig scripts that 
 use Zebra. Schema acceptable in previous release now may become illegal 
 because of this change. This should be clearly documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1238) Dump does not respect the schema


 [ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reassigned PIG-1238:
---

Assignee: Daniel Dai

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Daniel Dai

 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1248) [piggybank] useful String functions

2010-03-02 Thread Bill Graham (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840279#action_12840279
 ] 

Bill Graham commented on PIG-1248:
--

How exactly would split differ from the TOKENIZE function if split returned a 
bag? TOKENIZE returns an unordered bag of words. Having a function that returns 
an ordered tuple of words is very useful IMO. I had to write my own version of 
a tokenize UDF to do this. 

 [piggybank] useful String functions
 ---

 Key: PIG-1248
 URL: https://issues.apache.org/jira/browse/PIG-1248
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.7.0

 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff


 Pig ships with very few evalFuncs for working with strings. This jira is for 
 adding a few more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1270) Push limit into loader

Push limit into loader
--

 Key: PIG-1270
 URL: https://issues.apache.org/jira/browse/PIG-1270
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai


We can optimize limit operation by stopping early in PigRecordReader. In 
general, we need a way to communicate between PigRecordReader and execution 
pipeline. POLimit could instruct PigRecordReader that we have already had 
enough records and stop feeding more data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types

2010-03-02 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1263:
---

Assignee: Daniel Dai

 Script producing varying number of records when COGROUPing value of map data 
 type with and without types
 

 Key: PIG-1263
 URL: https://issues.apache.org/jira/browse/PIG-1263
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0


 I have a Pig script which I am experimenting upon. [[Albeit this is not 
 optimized and can be done in variety of ways]] I get different record counts 
 by placing load store pairs in the script.
 Case 1: Returns 424329 records
 Case 2: Returns 5859 records
 Case 3: Returns 5859 records
 Case 4: Returns 5578 records
 I am wondering what the correct result is?
 Here are the scripts.
 Case 1: 
 {code}
 register udf.jar
 A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);
 B = FOREACH A GENERATE
 s#'key1' as key1,
 s#'key2' as key2;
 C = FOREACH B generate key2;
 D = filter C by (key2 IS NOT null);
 E = distinct D;
 store E into 'unique_key_list' using PigStorage('\u0001');
 F = Foreach E generate key2, MapGenerate(key2) as m;
 G = FILTER F by (m IS NOT null);
 H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
 m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as 
 id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;
 I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12);
 J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
 group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
 group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
 group.id12 as id12;
 --load previous days data
 K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
 id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);
 L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12) OUTER,
  J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12) OUTER;
 M = filter L by IsEmpty(K);
 store M into 'cogroupNoTypes' using PigStorage();
 {code}
 Case 2:  Storing and loading intermediate results in J 
 {code}
 register udf.jar
 A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);
 B = FOREACH A GENERATE
 s#'key1' as key1,
 s#'key2' as key2;
 C = FOREACH B generate key2;
 D = filter C by (key2 IS NOT null);
 E = distinct D;
 store E into 'unique_key_list' using PigStorage('\u0001');
 F = Foreach E generate key2, MapGenerate(key2) as m;
 G = FILTER F by (m IS NOT null);
 H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
 m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as 
 id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;
 I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12);
 J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
 group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
 group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
 group.id12 as id12;
 --store intermediate data to HDFS and re-read
 store J into 'output/20100203/J' using PigStorage('\u0001');
 --load previous days data
 K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
 id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);
 --read J into K1
 K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, 
 id4, id5, id6, id7, id8, id9, id10, id11, id12);
 L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12) OUTER,
  K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
 id12) OUTER;
 M = filter L by IsEmpty(K);
 store M into 'cogroupNoTypesIntStore' using PigStorage();
 {code}
 Case 3: Types information specified but no intermediate store of J
 {code}
 register udf.jar
 A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);
 B = FOREACH A GENERATE
 s#'key1' as key1,
 s#'key2' as key2;
 C = FOREACH B generate key2;
 D = filter C by (key2 IS NOT null);
 E = distinct D;
 store E into 'unique_key_list' using PigStorage('\u0001');
 F = Foreach E generate key2, MapGenerate(key2) as m;
 G = FILTER F by (m IS NOT null);
 H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, 
 (long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, 
 (long)m#'id6' as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, 
 (chararray)m#'id9' as id9,

[jira] Created: (PIG-1271) Provide a more flexible data format to load complex field (bag/tuple/map) in PigStorage

Provide a more flexible data format to load complex field (bag/tuple/map) in 
PigStorage
---

 Key: PIG-1271
 URL: https://issues.apache.org/jira/browse/PIG-1271
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai


With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to 
load txt files containing complex data type (map/bag/tuple) according to 
schema. However, the format of complex data field is very strict. User have to 
use pre-determined special characters to mark the beginning and end of each 
field, and those special characters can not be used in the content. The goals 
of this issue are:

1. Provide a way for user to escape special characters
2. Make it easy for users to customize Utf8StorageConverter when they have 
their own data format



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection


 [ 
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1269:
-

Attachment: zebra.0302

 [Zebra] Restrict schema definition for collection
 -

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0302, zebra.0302


 Currently Zebra grammar for schema definition for collection field allows 
 many types of definition. To reduce complexity and remove ambiguity, and more 
 importantly, to make the meta data more representative of the actual data 
 instances, the grammar rules need to be changed. Only a record type is 
 allowed and required for collection definition. Thus,  
 fieldName:collection(record(c1:int, c2:string)) is legal, while 
 fieldName:collection(c1:int, c2:string), 
 fieldName:collection(f:record(c1:int, c2:string)), 
 fieldName:collection(c1:int), or feildName:collection(int) is illegal.
 This will have some impact on existing Zebra M/R programs or Pig scripts that 
 use Zebra. Schema acceptable in previous release now may become illegal 
 because of this change. This should be clearly documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection


 [ 
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1269:
-

Attachment: (was: zebra.0302)

 [Zebra] Restrict schema definition for collection
 -

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0302


 Currently Zebra grammar for schema definition for collection field allows 
 many types of definition. To reduce complexity and remove ambiguity, and more 
 importantly, to make the meta data more representative of the actual data 
 instances, the grammar rules need to be changed. Only a record type is 
 allowed and required for collection definition. Thus,  
 fieldName:collection(record(c1:int, c2:string)) is legal, while 
 fieldName:collection(c1:int, c2:string), 
 fieldName:collection(f:record(c1:int, c2:string)), 
 fieldName:collection(c1:int), or feildName:collection(int) is illegal.
 This will have some impact on existing Zebra M/R programs or Pig scripts that 
 use Zebra. Schema acceptable in previous release now may become illegal 
 because of this change. This should be clearly documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection


 [ 
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1269:
-

Status: Patch Available  (was: Open)

 [Zebra] Restrict schema definition for collection
 -

 Key: PIG-1269
 URL: https://issues.apache.org/jira/browse/PIG-1269
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0302


 Currently Zebra grammar for schema definition for collection field allows 
 many types of definition. To reduce complexity and remove ambiguity, and more 
 importantly, to make the meta data more representative of the actual data 
 instances, the grammar rules need to be changed. Only a record type is 
 allowed and required for collection definition. Thus,  
 fieldName:collection(record(c1:int, c2:string)) is legal, while 
 fieldName:collection(c1:int, c2:string), 
 fieldName:collection(f:record(c1:int, c2:string)), 
 fieldName:collection(c1:int), or feildName:collection(int) is illegal.
 This will have some impact on existing Zebra M/R programs or Pig scripts that 
 use Zebra. Schema acceptable in previous release now may become illegal 
 because of this change. This should be clearly documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-03-02 Thread Viraj Bhat (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840339#action_12840339
 ] 

Viraj Bhat commented on PIG-1252:
-

A modified version of the script works, does this have to do with nested 
foreach? 

{code}
loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
 
dump finalData;
{code}

 Diamond splitter does not generate correct results when using Multi-query 
 optimization
 --

 Key: PIG-1252
 URL: https://issues.apache.org/jira/browse/PIG-1252
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.7.0


 I have script which uses split but somehow does not use one of the split 
 branch. The skeleton of the script is as follows
 {code}
 loadData = load '/user/viraj/zebradata' using 
 org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
 col7');
 prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
 (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
 ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
 SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
 falseDataTmp IF (validRec == '1' AND splitcond == '');
 grpData = GROUP trueDataTmp BY splitcond;
 finalData = FOREACH grpData {
orderedData = ORDER trueDataTmp BY col1,col2;
GENERATE FLATTEN ( MYUDF (orderedData, 60, 
 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
   }
 dump finalData;
 {code}
 You can see that falseDataTmp is untouched.
 When I run this script with no-Multiquery (-M) option I get the right result. 
  This could be the result of complex BinCond's in the POLoad. We can get rid 
 of this error by using  FILTER instead of SPIT.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)

Column pruner causes wrong results
--

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


For a simple script the column pruner optimization removes certain columns from 
the original relation, which results in wrong results.

Input file kv contains the following columns (tab separated)
{code}
a   1
a   2
a   3
b   4
c   5
c   6
b   7
d   8
{code}

Now running this script in Pig 0.6 produces

{code}
kv = load 'kv' as (k,v);
keys= foreach kv generate k;
keys = distinct keys; 
keys = limit keys 2;
rejoin = join keys by k, kv by k;
dump rejoin;
{code}

(a,a)
(a,a)
(a,a)
(b,b)
(b,b)


Running this in Pig 0.5 version without column pruner results in:
(a,a,1)
(a,a,2)
(a,a,3)
(b,b,4)
(b,b,7)

When we disable the ColumnPruner optimization it gives right results.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840389#action_12840389
 ] 

Viraj Bhat commented on PIG-1272:
-

Now with Pig 0.7 or trunk we have the following error:

2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.NoSuchFieldError: sJobConf
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Viraj

 Column pruner causes wrong results
 --

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0


 For a simple script the column pruner optimization removes certain columns 
 from the original relation, which results in wrong results.
 Input file kv contains the following columns (tab separated)
 {code}
 a   1
 a   2
 a   3
 b   4
 c   5
 c   6
 b   7
 d   8
 {code}
 Now running this script in Pig 0.6 produces
 {code}
 kv = load 'kv' as (k,v);
 keys= foreach kv generate k;
 keys = distinct keys; 
 keys = limit keys 2;
 rejoin = join keys by k, kv by k;
 dump rejoin;
 {code}
 (a,a)
 (a,a)
 (a,a)
 (b,b)
 (b,b)
 Running this in Pig 0.5 version without column pruner results in:
 (a,a,1)
 (a,a,2)
 (a,a,3)
 (b,b,4)
 (b,b,7)
 When we disable the ColumnPruner optimization it gives right results.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1269) [Zebra] Restrict schema definition for collection

2010-03-02 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840407#action_12840407
]

Hadoop QA commented on PIG-1269:

+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12437638/zebra.0302
against trunk revision 917827.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 63 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/console

This message is automatically generated.

[Zebra] Restrict schema definition for collection
-

Key: PIG-1269
URL: https://issues.apache.org/jira/browse/PIG-1269
Project: Pig
Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
Fix For: 0.7.0

Attachments: zebra.0302

Currently Zebra grammar for schema definition for collection field allows
many types of definition. To reduce complexity and remove ambiguity, and more
importantly, to make the meta data more representative of the actual data
instances, the grammar rules need to be changed. Only a record type is
allowed and required for collection definition. Thus,
fieldName:collection(record(c1:int, c2:string)) is legal, while
fieldName:collection(c1:int, c2:string),
fieldName:collection(f:record(c1:int, c2:string)),
fieldName:collection(c1:int), or feildName:collection(int) is illegal.
This will have some impact on existing Zebra M/R programs or Pig scripts that
use Zebra. Schema acceptable in previous release now may become illegal
because of this change. This should be clearly documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1272) Column pruner causes wrong results


 [ 
https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1272:


Attachment: PIG-1272-1.patch

 Column pruner causes wrong results
 --

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1272-1.patch


 For a simple script the column pruner optimization removes certain columns 
 from the original relation, which results in wrong results.
 Input file kv contains the following columns (tab separated)
 {code}
 a   1
 a   2
 a   3
 b   4
 c   5
 c   6
 b   7
 d   8
 {code}
 Now running this script in Pig 0.6 produces
 {code}
 kv = load 'kv' as (k,v);
 keys= foreach kv generate k;
 keys = distinct keys; 
 keys = limit keys 2;
 rejoin = join keys by k, kv by k;
 dump rejoin;
 {code}
 (a,a)
 (a,a)
 (a,a)
 (b,b)
 (b,b)
 Running this in Pig 0.5 version without column pruner results in:
 (a,a,1)
 (a,a,2)
 (a,a,3)
 (b,b,4)
 (b,b,7)
 When we disable the ColumnPruner optimization it gives right results.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1273) Skewed join throws error

Skewed join throws error 
-

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1273) Skewed join throws error


[ 
https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840482#action_12840482
 ] 

Ankur commented on PIG-1273:


Here is a simple script to reproduce it

a = load 'test.dat' using PigStorage() as (nums:chararray);
b = load 'join.dat' using PigStorage('\u0001') as 
(number:chararray,text:chararray);
c = filter a by nums == '7';
d = join c by nums LEFT OUTER, b by number USING skewed;
dump d;

 test.dat 
1
2
3
4
5

= join.dat =
1^Aone
2^Atwo
3^Athree

where ^A means Control-A charatcer used as a separator.

 Skewed join throws error 
 -

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1273) Skewed join throws error


[ 
https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840483#action_12840483
 ] 

Ankur commented on PIG-1273:


Complete stack trace of the error thrown my 3rd M/R job in the pipeline

java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.(MapTask.java:448)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 6 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Empty 
samples file
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:128)
... 11 more
Caused by: java.lang.RuntimeException: Empty samples file
at 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil.loadPartitionFile(MapRedUtil.java:128)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:125)
... 11 more


 Skewed join throws error 
 -

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840496#action_12840496
 ] 

Hadoop QA commented on PIG-1272:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12437666/PIG-1272-1.patch
  against trunk revision 917827.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/console

This message is automatically generated.

 Column pruner causes wrong results
 --

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1272-1.patch


 For a simple script the column pruner optimization removes certain columns 
 from the original relation, which results in wrong results.
 Input file kv contains the following columns (tab separated)
 {code}
 a   1
 a   2
 a   3
 b   4
 c   5
 c   6
 b   7
 d   8
 {code}
 Now running this script in Pig 0.6 produces
 {code}
 kv = load 'kv' as (k,v);
 keys= foreach kv generate k;
 keys = distinct keys; 
 keys = limit keys 2;
 rejoin = join keys by k, kv by k;
 dump rejoin;
 {code}
 (a,a)
 (a,a)
 (a,a)
 (b,b)
 (b,b)
 Running this in Pig 0.5 version without column pruner results in:
 (a,a,1)
 (a,a,2)
 (a,a,3)
 (b,b,4)
 (b,b,7)
 When we disable the ColumnPruner optimization it gives right results.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1274) Column pruning throws Null pointer exception