[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789425#action_12789425 ] Chao Wang commented on PIG-1145: The patch looks good +1. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789411#action_12789411 ] Yan Zhou commented on PIG-1145: --- All failed test cases are PIG tests and look like enviromental. I reran the first failed test, TestJoin, in my local cluser, and it passes cleanly. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789376#action_12789376 ] Hadoop QA commented on PIG-1145: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427696/PIG-1145.patch against trunk revision 889346. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/console This message is automatically generated. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789196#action_12789196 ] Yan Zhou commented on PIG-1145: --- Actually with pruning enabled the exception stack is: Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.IOException: seekTo() failed: Column Groups are not evenly positioned. at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.seekTo(BasicTable.java:1148) at org.apache.hadoop.zebra.mapred.TableRecordReader.seekTo(TableRecordReader.java:120) at org.apache.hadoop.zebra.pig.TableLoader.seekNear(TableLoader.java:190) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:406) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184) ... 9 more > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch, PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at >
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789188#action_12789188 ] Jing Huang commented on PIG-1145: - found another failure on merge join This merge join script failed: register $zebraJar; --fs -rmr $outputDir --a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); --a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); --sort1 = order a1 by byte2; --sort2 = order a2 by byte2; --store sort1 into '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]'); --store sort2 into '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]'); rec1 = load '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); rec2 = load '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); joina = join rec1 by byte2, rec2 by byte2 using "merge" ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2, $4 as byte2; store E into '$outputDir/bad1' using org.apache.hadoop.zebra.pig.TableStorer(''); = instead, this similiar script works with the previous patch: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); sort1 = order a1 by byte2; sort2 = order a2 by byte2; store sort1 into '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]'); store sort2 into '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]'); rec1 = load '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); rec2 = load '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); joina = join rec1 by byte2, rec2 by byte2 using "merge" ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2, $4 as byte2; store E into '$outputDir/join3' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ Here is stack trace: Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1083) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 9 more = This is how I run it (i disabled pruning to simply the possible problem) java -cp /grid/0/dev/hadoopqa/jing1234/conf:/grid/0/dev/hadoopqa/jars/pig.jar:/grid/0/dev/hadoopqa/jars/tfile.jar:/grid/0/dev/hadoopqa/jars/zebra.jar org.apache.pig.Main -m config -M -t PruneColumns bad_join.pig > [zebr
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788952#action_12788952 ] Jing Huang commented on PIG-1145: - I verified fix. It works. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788607#action_12788607 ] Hadoop QA commented on PIG-1145: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427556/PIG-1145.patch against trunk revision 52. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/111/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/111/console This message is automatically generated. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(Table
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788524#action_12788524 ] Yan Zhou commented on PIG-1145: --- The problem is that if the seek call on the index table is on a key that is past the last key of a data file, the scanner is positioned past the EOF of that data file. Instead it should be positioned to the beginning of the next data file. As result, since the CGScanner.atEnd method only checks if the current file index is within the valid range and leaves the responsibility of setting the proper file index to the position movers such as the scanner's advance and seekTo methods, positioning the scanner past the EOF of any data file will cause an EOF to be thrown. The fix is to add a check in the scanner's seekTo method so that if after seek the position is past the end of a data file, it will be positioned to the start of the next data file, just as the advace method already does. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788496#action_12788496 ] Chao Wang commented on PIG-1145: Patch reviewed +1. > [zebra] merge join on large table ( 100,000.000 rows zebra table) failed > > > Key: PIG-1145 > URL: https://issues.apache.org/jira/browse/PIG-1145 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Jing Huang >Assignee: Yan Zhou > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1145.patch > > > Pig script : > register $zebraJar; > --fs -rmr $outputDir > a1 = LOAD '$inputDir/unsorted1' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > a2 = LOAD '$inputDir/unsorted2' USING > org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); > sort1 = order a1 by str2; > sort2 = order a2 by str2; > --store sort1 into '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > --store sort2 into '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); > rec1 = load '$outputDir/sorted11' using > org.apache.hadoop.zebra.pig.TableLoader(); > rec2 = load '$outputDir/sorted21' using > org.apache.hadoop.zebra.pig.TableLoader(); > joina = join rec1 by str2, rec2 by str2 using "merge" ; > --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as > str2; > store joina into '$outputDir/join1' using > org.apache.hadoop.zebra.pig.TableStorer(''); > ~ > > > ~ > > > ~ > == > stacktrace: > org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error > processing right input during merge join at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: > java.io.EOFException: No key-value to read at > org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) > at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) > at > org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) > at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.