date:20091211


[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789196#action_12789196
 ] 

Yan Zhou commented on PIG-1145:
---

Actually with pruning enabled the exception stack is:

Backend error message
-
org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
processing right input during merge join
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:186)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.io.IOException: seekTo() failed: Column Groups are not evenly 
positioned.
at 
org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.seekTo(BasicTable.java:1148)
at 
org.apache.hadoop.zebra.mapred.TableRecordReader.seekTo(TableRecordReader.java:120)
at 
org.apache.hadoop.zebra.pig.TableLoader.seekNear(TableLoader.java:190)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:406)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184)
... 9 more


 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at

[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Status: Open  (was: Patch Available)

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Attachment: PIG-1145.patch

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Status: Patch Available  (was: Open)

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: automaton.jar
poregex2.patch

New patch with removed comments and added automaton.jar from 
http://www.brics.dk/~amoeller/automaton/automaton.jar.

It fails findBugs due to missing symbols. I ran the findBugs after adding the 
jar to the build and it did not complain about any findBugs in the modified and 
added files.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning


[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789215#action_12789215
 ] 

Hadoop QA commented on PIG-1142:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427671/PIG-1142-2.patch
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/console

This message is automatically generated.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

One small change to JarManager.java is missing. Will add a new patch with it.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1106) FR join should not spill


 [ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-1106:


Status: Patch Available  (was: Open)

This patch does not have any unit tests.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1106) FR join should not spill


[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789294#action_12789294
 ] 

Ankit Modi commented on PIG-1106:
-

Tests I ran were using two files

file format
f1: random chararray(100)
f2: random int

leftside file contained 100 tuples and right side file contain 3million tuples.

Code
{noformat}
A = load 'leftsidefrjoin.txt' as ( key, value);
B = load 'rightsidefrjoin.txt' as (key, value);
C = join A by key left, B by key using repl;
--- Fragmented input and replicated input
store C into 'output';
{noformat}

This generated following error
{noformat}
FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : 
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.init(ArrayList.java:112)
at org.apache.pig.data.DefaultTuple.init(DefaultTuple.java:63)
at 
org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:369)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:288)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:351)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:211)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:241)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
{noformat}

I ran the same job with same records on left hand side and 100K records on 
right hand side. The job completed successfully.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: automaton.jar)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)


 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: automaton.jar
poregex2.patch

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789376#action_12789376
 ] 

Hadoop QA commented on PIG-1145:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427696/PIG-1145.patch
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/console

This message is automatically generated.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at

[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

[
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789380#action_12789380
]

Hadoop QA commented on PIG-965:
---

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar
against trunk revision 889346.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console

This message is automatically generated.

PERFORMANCE: optimize common case in matches (PORegex)
--

Key: PIG-965
URL: https://issues.apache.org/jira/browse/PIG-965
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
Attachments: automaton.jar, poregex2.patch

Some frequently seen use cases of 'matches' comparison operator have follow
properties -
1. The rhs is a constant string . eg c1 matches 'abc%'
2. Regexes such that look for matching prefix , suffix etc are very common.
eg - abc%', %abc, '%abc%'
To optimize for these common cases , PORegex.java can be changed to -
1. Compile the pattern (rhs of matches) re-use it if the pattern string has
not changed.
2. Use string comparisons for simple common regexes (in 2 above).
The implementation of Hive like clause uses similar optimizations.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning


[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789388#action_12789388
 ] 

Olga Natkovich commented on PIG-1142:
-

+1, Daniel, code changes look good but do we need to add more unit tests to 
cover them?

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-11 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789408#action_12789408
 ] 

Jeff Zhang commented on PIG-1110:
-

Hi Richard, I checked your code and find that pig use the output file extension 
to determined whether the output should be compressed or non-compressed ( the 
code in trunk also do like this). I do not think this method is good enough, 
because it enforce user to add .bz2 to as the extension of output file. 

My suggestion is as follows:

Add a new constructor to PigStorage like PigStorage(String delimiter, String 
extension)

the extension indicate what file format user want to store 



 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789411#action_12789411
 ] 

Yan Zhou commented on PIG-1145:
---

All failed test cases are PIG tests and look like enviromental. I reran the 
first failed test, TestJoin, in my local cluser, and it passes cleanly.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Chao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789425#action_12789425
 ] 

Chao Wang commented on PIG-1145:


The patch looks good +1.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning


 [ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1142:


Attachment: PIG-1142-3.patch

No code change in PIG-1142-3.patch, except for more test cases.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed


 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Apache trunk and 0.6 branch.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789440#action_12789440
 ] 

Alan Gates commented on PIG-1142:
-

Additional test cases look good.  +1

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning


 [ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1142:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

New patch committed to both trunk and 0.6 branch.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1147) Zebra Docs for Pig 0.6.0

Zebra Docs for Pig 0.6.0


 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0


Zebra docs for Pig 0.6.0

(1) XML files

UPDATE:
site.xml - updated to include Zebra items in Pig menu

NEW:
zebra_mapreduce.xml
zebra_overview.xml
zebra_pig.xml
zebra_reference.xml
zebra_stream.xml
zebra_users.xm.

(2) IMAGE file 
zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0


 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Attachment: zebra.jpg

Zerbra image file.

(1) Add this file TRUNK

C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs\images


(2) Add this file to branch-0.6

http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/images/


 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0


 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Attachment: Zebra.patch

Patch file

(1) Apply this patch to the TRUNK

C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs


(2) Apply this patch to the branch-0.6

http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/


NOTE: No new test code required; changes to documentation only.

 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0


 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Status: Patch Available  (was: Open)

(1) Add Zebra image file to Pig TRUNK and branch-0.6

(2) Apply Zebra patch to Pig TRUNK and branch-0.6

Note: No new test code required; changes to documentation only.


 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly


 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Attachment: PIG-1144-3.patch

Change the patch to take mapred.reduce.tasks into account. The hierarchy for 
determining the parallelism is:
1. PARALLEL keywords
2. default_parallel
3. mapred.reduce.tasks system property
4. default value: 1

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly


 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Open  (was: Patch Available)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly


 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Patch Available  (was: Open)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1086) Nested sort by * throw exception


 [ 
https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1086:


   Resolution: Fixed
Fix Version/s: 0.7.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Patch committed. Thanks Richard!

 Nested sort by * throw exception
 

 Key: PIG-1086
 URL: https://issues.apache.org/jira/browse/PIG-1086
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1086.patch


 The following script fail:
 A = load '1.txt' as (a0, a1, a2);
 B = group A by a0;
 C = foreach B { D = order A by *; generate group, D;};
 explain C;
 Here is the stack:
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.get(ArrayList.java:324)
 at 
 org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
 at 
 org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
 at org.apache.pig.PigServer.compilePp(PigServer.java:864)
 at org.apache.pig.PigServer.explain(PigServer.java:583)
 ... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0


 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1147:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to both trunk and 0.6 branch. Thanks Corinne!

 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1106) FR join should not spill

[
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789623#action_12789623
]

Hadoop QA commented on PIG-1106:

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12427716/frjoin-nonspill.patch
against trunk revision 889346.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/console

This message is automatically generated.

FR join should not spill

Key: PIG-1106
URL: https://issues.apache.org/jira/browse/PIG-1106
Project: Pig
Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
Fix For: 0.7.0

Attachments: frjoin-nonspill.patch

Currently, the values for the replicated side of the data are placed in a
spillable bag (POFRJoin near line 275). This does not make sense because the
whole point of the optimization is that the data on one side fits into
memory. We already have a non-spillable bag implemented
(NonSpillableDataBag.java) and we need to change FRJoin code to use it. And
of course need to do lots of testing to make sure that we don't spill but die
instead when we run out of memory

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1106) FR join should not spill


[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789624#action_12789624
 ] 

Olga Natkovich commented on PIG-1106:
-

Test failures are not due to this patch. Also, I don't believe it is easy to 
test with an automatic test but I believe Ankit tested it manually.

I will review the code and run test-commit + FRJoin tests before committing the 
patch.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs


 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1085:


Fix Version/s: 0.6.0

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.6.0

 Attachments: udfconf-2.patch, udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1147) Zebra Docs for Pig 0.6.0