[jira] Created: (PIG-1269) [Zebra] Restrict schema definition for collection
[Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1269: - Status: Patch Available (was: Open) [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1262) Additional findbugs and javac warnings
[ https://issues.apache.org/jira/browse/PIG-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840263#action_12840263 ] Olga Natkovich commented on PIG-1262: - +1 Additional findbugs and javac warnings -- Key: PIG-1262 URL: https://issues.apache.org/jira/browse/PIG-1262 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1262-1.patch After a while, we have introduced some new findbugs and javacc warnings. Will fix them in this Jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1269: - Status: Open (was: Patch Available) [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reassigned PIG-1238: --- Assignee: Daniel Dai Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Daniel Dai For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840279#action_12840279 ] Bill Graham commented on PIG-1248: -- How exactly would split differ from the TOKENIZE function if split returned a bag? TOKENIZE returns an unordered bag of words. Having a function that returns an ordered tuple of words is very useful IMO. I had to write my own version of a tokenize UDF to do this. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1270) Push limit into loader
Push limit into loader -- Key: PIG-1270 URL: https://issues.apache.org/jira/browse/PIG-1270 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai We can optimize limit operation by stopping early in PigRecordReader. In general, we need a way to communicate between PigRecordReader and execution pipeline. POLimit could instruct PigRecordReader that we have already had enough records and stop feeding more data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types
[ https://issues.apache.org/jira/browse/PIG-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1263: --- Assignee: Daniel Dai Script producing varying number of records when COGROUPing value of map data type with and without types Key: PIG-1263 URL: https://issues.apache.org/jira/browse/PIG-1263 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 I have a Pig script which I am experimenting upon. [[Albeit this is not optimized and can be done in variety of ways]] I get different record counts by placing load store pairs in the script. Case 1: Returns 424329 records Case 2: Returns 5859 records Case 3: Returns 5859 records Case 4: Returns 5578 records I am wondering what the correct result is? Here are the scripts. Case 1: {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypes' using PigStorage(); {code} Case 2: Storing and loading intermediate results in J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --store intermediate data to HDFS and re-read store J into 'output/20100203/J' using PigStorage('\u0001'); --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); --read J into K1 K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypesIntStore' using PigStorage(); {code} Case 3: Types information specified but no intermediate store of J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, (long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as id9,
[jira] Created: (PIG-1271) Provide a more flexible data format to load complex field (bag/tuple/map) in PigStorage
Provide a more flexible data format to load complex field (bag/tuple/map) in PigStorage --- Key: PIG-1271 URL: https://issues.apache.org/jira/browse/PIG-1271 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to load txt files containing complex data type (map/bag/tuple) according to schema. However, the format of complex data field is very strict. User have to use pre-determined special characters to mark the beginning and end of each field, and those special characters can not be used in the content. The goals of this issue are: 1. Provide a way for user to escape special characters 2. Make it easy for users to customize Utf8StorageConverter when they have their own data format -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1269: - Attachment: zebra.0302 [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302, zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1269: - Attachment: (was: zebra.0302) [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1269: - Status: Patch Available (was: Open) [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840339#action_12840339 ] Viraj Bhat commented on PIG-1252: - A modified version of the script works, does this have to do with nested foreach? {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); dump finalData; {code} Diamond splitter does not generate correct results when using Multi-query optimization -- Key: PIG-1252 URL: https://issues.apache.org/jira/browse/PIG-1252 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1272) Column pruner causes wrong results
Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840389#action_12840389 ] Viraj Bhat commented on PIG-1272: - Now with Pig 0.7 or trunk we have the following error: 2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchFieldError: sJobConf at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1269) [Zebra] Restrict schema definition for collection
[ https://issues.apache.org/jira/browse/PIG-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840407#action_12840407 ] Hadoop QA commented on PIG-1269: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437638/zebra.0302 against trunk revision 917827. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 63 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/219/console This message is automatically generated. [Zebra] Restrict schema definition for collection - Key: PIG-1269 URL: https://issues.apache.org/jira/browse/PIG-1269 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0302 Currently Zebra grammar for schema definition for collection field allows many types of definition. To reduce complexity and remove ambiguity, and more importantly, to make the meta data more representative of the actual data instances, the grammar rules need to be changed. Only a record type is allowed and required for collection definition. Thus, fieldName:collection(record(c1:int, c2:string)) is legal, while fieldName:collection(c1:int, c2:string), fieldName:collection(f:record(c1:int, c2:string)), fieldName:collection(c1:int), or feildName:collection(int) is illegal. This will have some impact on existing Zebra M/R programs or Pig scripts that use Zebra. Schema acceptable in previous release now may become illegal because of this change. This should be clearly documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1272: Attachment: PIG-1272-1.patch Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1272-1.patch For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1273) Skewed join throws error
Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1273) Skewed join throws error
[ https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840482#action_12840482 ] Ankur commented on PIG-1273: Here is a simple script to reproduce it a = load 'test.dat' using PigStorage() as (nums:chararray); b = load 'join.dat' using PigStorage('\u0001') as (number:chararray,text:chararray); c = filter a by nums == '7'; d = join c by nums LEFT OUTER, b by number USING skewed; dump d; test.dat 1 2 3 4 5 = join.dat = 1^Aone 2^Atwo 3^Athree where ^A means Control-A charatcer used as a separator. Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1273) Skewed join throws error
[ https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840483#action_12840483 ] Ankur commented on PIG-1273: Complete stack trace of the error thrown my 3rd M/R job in the pipeline java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.(MapTask.java:448) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 6 more Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:128) ... 11 more Caused by: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil.loadPartitionFile(MapRedUtil.java:128) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:125) ... 11 more Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840496#action_12840496 ] Hadoop QA commented on PIG-1272: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437666/PIG-1272-1.patch against trunk revision 917827. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/220/console This message is automatically generated. Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1272-1.patch For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1274) Column pruning throws Null pointer exception
Column pruning throws Null pointer exception Key: PIG-1274 URL: https://issues.apache.org/jira/browse/PIG-1274 Project: Pig Issue Type: Bug Reporter: Ankur In case data has missing values for certain columns in a relation participating in a join, column pruning throws null pointer exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.