[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Jing Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789188#action_12789188
 ] 

Jing Huang commented on PIG-1145:
-

found another failure on merge join
This merge join script failed:
register $zebraJar;
--fs -rmr $outputDir


--a1 = LOAD '$inputDir/unsorted1' USING 
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
--a2 = LOAD '$inputDir/unsorted2' USING 
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');

--sort1 = order a1 by byte2;
--sort2 = order a2 by byte2;

--store sort1 into '$outputDir/100Msortedbyte21' using 
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]');
--store sort2 into '$outputDir/100Msortedbyte22' using 
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]');

rec1 = load '$outputDir/100Msortedbyte21' using 
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
rec2 = load '$outputDir/100Msortedbyte22' using 
org.apache.hadoop.zebra.pig.TableLoader('','sorted');

joina = join rec1 by byte2, rec2 by byte2 using merge ;

E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as str2, 
$4 as byte2;

store E into '$outputDir/bad1' using 
org.apache.hadoop.zebra.pig.TableStorer('');
=
instead, this similiar script works with the previous patch:
register $zebraJar;
--fs -rmr $outputDir


a1 = LOAD '$inputDir/unsorted1' USING 
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
a2 = LOAD '$inputDir/unsorted2' USING 
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');

sort1 = order a1 by byte2;
sort2 = order a2 by byte2;

store sort1 into '$outputDir/100Msortedbyte21' using 
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]');
store sort2 into '$outputDir/100Msortedbyte22' using 
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]');

rec1 = load '$outputDir/100Msortedbyte21' using 
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
rec2 = load '$outputDir/100Msortedbyte22' using 
org.apache.hadoop.zebra.pig.TableLoader('','sorted');

joina = join rec1 by byte2, rec2 by byte2 using merge ;

E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as str2, 
$4 as byte2;

store E into '$outputDir/join3' using 
org.apache.hadoop.zebra.pig.TableStorer('');
~ 

Here is stack trace:
Backend error message
-
org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
processing right input during merge join
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.io.EOFException: No key-value to read
at 
org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590)
at 
org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611)
at 
org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
at 
org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
at 
org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1083)
at 
org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
... 9 more
=
This is how I run it (i disabled pruning to simply the possible problem)
java -cp 
/grid/0/dev/hadoopqa/jing1234/conf:/grid/0/dev/hadoopqa/jars/pig.jar:/grid/0/dev/hadoopqa/jars/tfile.jar:/grid/0/dev/hadoopqa/jars/zebra.jar
 org.apache.pig.Main -m config -M -t PruneColumns bad_join.pig 



 [zebra] 

[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789196#action_12789196
 ] 

Yan Zhou commented on PIG-1145:
---

Actually with pruning enabled the exception stack is:

Backend error message
-
org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
processing right input during merge join
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:186)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.io.IOException: seekTo() failed: Column Groups are not evenly 
positioned.
at 
org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.seekTo(BasicTable.java:1148)
at 
org.apache.hadoop.zebra.mapred.TableRecordReader.seekTo(TableRecordReader.java:120)
at 
org.apache.hadoop.zebra.pig.TableLoader.seekNear(TableLoader.java:190)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:406)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184)
... 9 more


 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 

[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Status: Open  (was: Patch Available)

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Attachment: PIG-1145.patch

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Status: Patch Available  (was: Open)

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: automaton.jar
poregex2.patch

New patch with removed comments and added automaton.jar from 
http://www.brics.dk/~amoeller/automaton/automaton.jar.

It fails findBugs due to missing symbols. I ran the findBugs after adding the 
jar to the build and it did not complain about any findBugs in the modified and 
added files.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789215#action_12789215
 ] 

Hadoop QA commented on PIG-1142:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427671/PIG-1142-2.patch
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/console

This message is automatically generated.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

One small change to JarManager.java is missing. Will add a new patch with it.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1106) FR join should not spill

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-1106:


Status: Patch Available  (was: Open)

This patch does not have any unit tests.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1106) FR join should not spill

2009-12-11 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789294#action_12789294
 ] 

Ankit Modi commented on PIG-1106:
-

Tests I ran were using two files

file format
f1: random chararray(100)
f2: random int

leftside file contained 100 tuples and right side file contain 3million tuples.

Code
{noformat}
A = load 'leftsidefrjoin.txt' as ( key, value);
B = load 'rightsidefrjoin.txt' as (key, value);
C = join A by key left, B by key using repl;
--- Fragmented input and replicated input
store C into 'output';
{noformat}

This generated following error
{noformat}
FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : 
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.init(ArrayList.java:112)
at org.apache.pig.data.DefaultTuple.init(DefaultTuple.java:63)
at 
org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:369)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:288)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:351)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:211)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:241)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
{noformat}

I ran the same job with same records on left hand side and 100K records on 
right hand side. The job completed successfully.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: automaton.jar)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: automaton.jar
poregex2.patch

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789376#action_12789376
 ] 

Hadoop QA commented on PIG-1145:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427696/PIG-1145.patch
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/console

This message is automatically generated.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 

[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789380#action_12789380
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789388#action_12789388
 ] 

Olga Natkovich commented on PIG-1142:
-

+1, Daniel, code changes look good but do we need to add more unit tests to 
cover them?

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-11 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789408#action_12789408
 ] 

Jeff Zhang commented on PIG-1110:
-

Hi Richard, I checked your code and find that pig use the output file extension 
to determined whether the output should be compressed or non-compressed ( the 
code in trunk also do like this). I do not think this method is good enough, 
because it enforce user to add .bz2 to as the extension of output file. 

My suggestion is as follows:

Add a new constructor to PigStorage like PigStorage(String delimiter, String 
extension)

the extension indicate what file format user want to store 



 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789411#action_12789411
 ] 

Yan Zhou commented on PIG-1145:
---

All failed test cases are PIG tests and look like enviromental. I reran the 
first failed test, TestJoin, in my local cluser, and it passes cleanly.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789425#action_12789425
 ] 

Chao Wang commented on PIG-1145:


The patch looks good +1.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1142:


Attachment: PIG-1142-3.patch

No code change in PIG-1142-3.patch, except for more test cases.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed

2009-12-11 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1145:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Apache trunk and 0.6 branch.

 [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
 

 Key: PIG-1145
 URL: https://issues.apache.org/jira/browse/PIG-1145
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0, 0.7.0

 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch


 Pig script :
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/unsorted1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/unsorted2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/sorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/sorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 --E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 store joina into '$outputDir/join1' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 ~ 
   
 
 ~ 
   
 
 ~  
 ==
 stacktrace:
 org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error 
 processing right input during merge join at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: 
 java.io.EOFException: No key-value to read at 
 org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) 
 at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) 
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
  at 
 org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
  at 
 org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
  at 
 org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
  at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
  ... 7 more 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789440#action_12789440
 ] 

Alan Gates commented on PIG-1142:
-

Additional test cases look good.  +1

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1142:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

New patch committed to both trunk and 0.6 branch.

 Got NullPointerException merge join with pruning
 

 Key: PIG-1142
 URL: https://issues.apache.org/jira/browse/PIG-1142
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch


 Here is my pig script:
 register $zebraJar;
 --fs -rmr $outputDir
 a1 = LOAD '$inputDir/small1' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 a2 = LOAD '$inputDir/small2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
 sort1 = order a1 by str2;
 sort2 = order a2 by str2;
 --store sort1 into '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 --store sort2 into '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
 rec1 = load '$outputDir/smallsorted11' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 rec2 = load '$outputDir/smallsorted21' using 
 org.apache.hadoop.zebra.pig.TableLoader();
 joina = join rec1 by str2, rec2 by str2 using merge ;
 E = foreach joina  generate $0 as count,  $1 as seed,  $2 as int1,  $3 as 
 str2;
 --limitedVals = LIMIT E 5;
 --dump limitedVals;
 store E into '$outputDir/smalljoin2' using 
 org.apache.hadoop.zebra.pig.TableStorer('');
 
 Here is the stacktrace:
 java.lang.NullPointerException at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
  at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at 
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
 org.apache.hadoop.mapred.Child.main(Child.java:159) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Corinne Chandel (JIRA)
Zebra Docs for Pig 0.6.0


 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0


Zebra docs for Pig 0.6.0

(1) XML files

UPDATE:
site.xml - updated to include Zebra items in Pig menu

NEW:
zebra_mapreduce.xml
zebra_overview.xml
zebra_pig.xml
zebra_reference.xml
zebra_stream.xml
zebra_users.xm.

(2) IMAGE file 
zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Attachment: zebra.jpg

Zerbra image file.

(1) Add this file TRUNK

C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs\images


(2) Add this file to branch-0.6

http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/images/


 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Attachment: Zebra.patch

Patch file

(1) Apply this patch to the TRUNK

C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs


(2) Apply this patch to the branch-0.6

http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/


NOTE: No new test code required; changes to documentation only.

 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1147:
-

Status: Patch Available  (was: Open)

(1) Add Zebra image file to Pig TRUNK and branch-0.6

(2) Apply Zebra patch to Pig TRUNK and branch-0.6

Note: No new test code required; changes to documentation only.


 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Attachment: PIG-1144-3.patch

Change the patch to take mapred.reduce.tasks into account. The hierarchy for 
determining the parallelism is:
1. PARALLEL keywords
2. default_parallel
3. mapred.reduce.tasks system property
4. default value: 1

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Open  (was: Patch Available)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Patch Available  (was: Open)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1086) Nested sort by * throw exception

2009-12-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1086:


   Resolution: Fixed
Fix Version/s: 0.7.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Patch committed. Thanks Richard!

 Nested sort by * throw exception
 

 Key: PIG-1086
 URL: https://issues.apache.org/jira/browse/PIG-1086
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1086.patch


 The following script fail:
 A = load '1.txt' as (a0, a1, a2);
 B = group A by a0;
 C = foreach B { D = order A by *; generate group, D;};
 explain C;
 Here is the stack:
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.get(ArrayList.java:324)
 at 
 org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
 at 
 org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
 at org.apache.pig.PigServer.compilePp(PigServer.java:864)
 at org.apache.pig.PigServer.explain(PigServer.java:583)
 ... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1147:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to both trunk and 0.6 branch. Thanks Corinne!

 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1106) FR join should not spill

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789623#action_12789623
 ] 

Hadoop QA commented on PIG-1106:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427716/frjoin-nonspill.patch
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/console

This message is automatically generated.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1106) FR join should not spill

2009-12-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789624#action_12789624
 ] 

Olga Natkovich commented on PIG-1106:
-

Test failures are not due to this patch. Also, I don't believe it is easy to 
test with an automatic test but I believe Ankit tested it manually.

I will review the code and run test-commit + FRJoin tests before committing the 
patch.

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs

2009-12-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1085:


Fix Version/s: 0.6.0

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.6.0

 Attachments: udfconf-2.patch, udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1147) Zebra Docs for Pig 0.6.0

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789641#action_12789641
 ] 

Hadoop QA commented on PIG-1147:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427764/Zebra.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/119/console

This message is automatically generated.

 Zebra Docs for Pig 0.6.0
 

 Key: PIG-1147
 URL: https://issues.apache.org/jira/browse/PIG-1147
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra.jpg, Zebra.patch


 Zebra docs for Pig 0.6.0
 (1) XML files
 UPDATE:
 site.xml - updated to include Zebra items in Pig menu
 NEW:
 zebra_mapreduce.xml
 zebra_overview.xml
 zebra_pig.xml
 zebra_reference.xml
 zebra_stream.xml
 zebra_users.xm.
 (2) IMAGE file 
 zebra.jpg

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Pig-trunk #645

2009-12-11 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/645/changes

Changes:

[daijy] PIG-1142: Got NullPointerException merge join with pruning

[yanz] PIG-1145: Merge Join on Large Table throws an EOF exception (yanz)

--
[...truncated 228593 lines...]
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.split. 
blk_8744047869084643086_1014
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_8744047869084643086_1014 src: /127.0.0.1:34013 dest: /127.0.0.1:57637
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_8744047869084643086_1014 src: /127.0.0.1:38898 dest: /127.0.0.1:46559
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_8744047869084643086_1014 src: /127.0.0.1:53824 dest: /127.0.0.1:56621
[junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:53824, 
dest: /127.0.0.1:56621, bytes: 1589, op: HDFS_WRITE, cliID: 
DFSClient_-1419330198, srvID: DS-2034772238-127.0.1.1-56621-1260583896758, 
blockid: blk_8744047869084643086_1014
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 0 for 
block blk_8744047869084643086_1014 terminating
[junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:38898, 
dest: /127.0.0.1:46559, bytes: 1589, op: HDFS_WRITE, cliID: 
DFSClient_-1419330198, srvID: DS-815252464-127.0.1.1-46559-1260583896308, 
blockid: blk_8744047869084643086_1014
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:56621 is added to 
blk_8744047869084643086_1014 size 1589
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 1 for 
block blk_8744047869084643086_1014 terminating
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:46559 is added to 
blk_8744047869084643086_1014 size 1589
[junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:34013, 
dest: /127.0.0.1:57637, bytes: 1589, op: HDFS_WRITE, cliID: 
DFSClient_-1419330198, srvID: DS-1844402489-127.0.1.1-57637-1260583897211, 
blockid: blk_8744047869084643086_1014
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 2 for 
block blk_8744047869084643086_1014 terminating
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57637 is added to 
blk_8744047869084643086_1014 size 1589
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: DIR* 
NameSystem.completeFile: file 
/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.split is closed 
by DFSClient_-1419330198
[junit] 09/12/12 02:12:08 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=create  
src=/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml 
dst=nullperm=hudson:supergroup:rw-r--r--
[junit] 09/12/12 02:12:08 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=setPermission   
src=/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml 
dst=nullperm=hudson:supergroup:rw-r--r--
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml. 
blk_2666416874129524588_1015
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_2666416874129524588_1015 src: /127.0.0.1:53825 dest: /127.0.0.1:56621
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_2666416874129524588_1015 src: /127.0.0.1:34017 dest: /127.0.0.1:57637
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block 
blk_2666416874129524588_1015 src: /127.0.0.1:59529 dest: /127.0.0.1:41270
[junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:59529, 
dest: /127.0.0.1:41270, bytes: 48759, op: HDFS_WRITE, cliID: 
DFSClient_-1419330198, srvID: DS-486837220-127.0.1.1-41270-1260583895816, 
blockid: blk_2666416874129524588_1015
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 0 for 
block blk_2666416874129524588_1015 terminating
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:41270 is added to 
blk_2666416874129524588_1015 size 48759
[junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:34017, 
dest: /127.0.0.1:57637, bytes: 48759, op: HDFS_WRITE, cliID: 
DFSClient_-1419330198, srvID: DS-1844402489-127.0.1.1-57637-1260583897211, 
blockid: blk_2666416874129524588_1015
[junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57637 is added to 
blk_2666416874129524588_1015 size 48759
[junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 1 for 
block blk_2666416874129524588_1015 terminating

[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789668#action_12789668
 ] 

Thejas M Nair commented on PIG-965:
---

Review comments: 

* The regex will always be on the rhs. So we don't need the code/classes which 
tries to determine which side has the regular expression based on which side 
has constant.

* in determineBestRegexMethod,  need to add (? to the list of regex strings 
not supported in dk.bricks (in javaRegexOnly) . It has special meanings in java 
regex, which is not honored by dk.brics .

* in determineBestRegexMethod,  We are dealing with cases like \d (choose 
java regex), \\d (choose dk.brics), but not dealing with \\\d (which should 
be choose java regex). ie we need to go back until we find a non '\' char.

* in RegexInit.compile(..), the following message is more appropriate at debug 
level, not at info . At info level, it might also confuse the user.
+log.info(Got an IllegalArgumentException for Pattern:  + 
pattern );
+log.info(e.getMessage());
+log.info(Switching to java.util.regex );

* The following comment in PORegex.java seems to be out of place . 
 // This is a BinaryComparisonOperator hence there can only be two inputs



 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.