[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789188#action_12789188 ] Jing Huang commented on PIG-1145: - found another failure on merge join This merge join script failed: register $zebraJar; --fs -rmr $outputDir --a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); --a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); --sort1 = order a1 by byte2; --sort2 = order a2 by byte2; --store sort1 into '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]'); --store sort2 into '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]'); rec1 = load '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); rec2 = load '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); joina = join rec1 by byte2, rec2 by byte2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2, $4 as byte2; store E into '$outputDir/bad1' using org.apache.hadoop.zebra.pig.TableStorer(''); = instead, this similiar script works with the previous patch: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2'); sort1 = order a1 by byte2; sort2 = order a2 by byte2; store sort1 into '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]'); store sort2 into '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]'); rec1 = load '$outputDir/100Msortedbyte21' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); rec2 = load '$outputDir/100Msortedbyte22' using org.apache.hadoop.zebra.pig.TableLoader('','sorted'); joina = join rec1 by byte2, rec2 by byte2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2, $4 as byte2; store E into '$outputDir/join3' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ Here is stack trace: Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1083) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 9 more = This is how I run it (i disabled pruning to simply the possible problem) java -cp /grid/0/dev/hadoopqa/jing1234/conf:/grid/0/dev/hadoopqa/jars/pig.jar:/grid/0/dev/hadoopqa/jars/tfile.jar:/grid/0/dev/hadoopqa/jars/zebra.jar org.apache.pig.Main -m config -M -t PruneColumns bad_join.pig [zebra]
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789196#action_12789196 ] Yan Zhou commented on PIG-1145: --- Actually with pruning enabled the exception stack is: Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.IOException: seekTo() failed: Column Groups are not evenly positioned. at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.seekTo(BasicTable.java:1148) at org.apache.hadoop.zebra.mapred.TableRecordReader.seekTo(TableRecordReader.java:120) at org.apache.hadoop.zebra.pig.TableLoader.seekNear(TableLoader.java:190) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:406) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184) ... 9 more [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1145: -- Status: Open (was: Patch Available) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1145: -- Attachment: PIG-1145.patch [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1145: -- Status: Patch Available (was: Open) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: (was: poregex2.patch) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: (was: poregex2.patch) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Patch Available (was: Open) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: automaton.jar poregex2.patch New patch with removed comments and added automaton.jar from http://www.brics.dk/~amoeller/automaton/automaton.jar. It fails findBugs due to missing symbols. I ran the findBugs after adding the jar to the build and it did not complain about any findBugs in the modified and added files. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning
[ https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789215#action_12789215 ] Hadoop QA commented on PIG-1142: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427671/PIG-1142-2.patch against trunk revision 889346. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/115/console This message is automatically generated. Got NullPointerException merge join with pruning Key: PIG-1142 URL: https://issues.apache.org/jira/browse/PIG-1142 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1142-1.patch, PIG-1142-2.patch Here is my pig script: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/small1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/small2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; --limitedVals = LIMIT E 5; --dump limitedVals; store E into '$outputDir/smalljoin2' using org.apache.hadoop.zebra.pig.TableStorer(''); Here is the stacktrace: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Open (was: Patch Available) One small change to JarManager.java is missing. Will add a new patch with it. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-1106: Status: Patch Available (was: Open) This patch does not have any unit tests. FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789294#action_12789294 ] Ankit Modi commented on PIG-1106: - Tests I ran were using two files file format f1: random chararray(100) f2: random int leftside file contained 100 tuples and right side file contain 3million tuples. Code {noformat} A = load 'leftsidefrjoin.txt' as ( key, value); B = load 'rightsidefrjoin.txt' as (key, value); C = join A by key left, B by key using repl; --- Fragmented input and replicated input store C into 'output'; {noformat} This generated following error {noformat} FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.init(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.init(DefaultTuple.java:63) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:369) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:288) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:351) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:211) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:250) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:241) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) {noformat} I ran the same job with same records on left hand side and 100K records on right hand side. The job completed successfully. FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: (was: automaton.jar) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: (was: poregex2.patch) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Patch Available (was: Open) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: automaton.jar poregex2.patch PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789376#action_12789376 ] Hadoop QA commented on PIG-1145: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427696/PIG-1145.patch against trunk revision 889346. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/116/console This message is automatically generated. [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at
[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789380#action_12789380 ] Hadoop QA commented on PIG-965: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar against trunk revision 889346. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console This message is automatically generated. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning
[ https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789388#action_12789388 ] Olga Natkovich commented on PIG-1142: - +1, Daniel, code changes look good but do we need to add more unit tests to cover them? Got NullPointerException merge join with pruning Key: PIG-1142 URL: https://issues.apache.org/jira/browse/PIG-1142 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1142-1.patch, PIG-1142-2.patch Here is my pig script: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/small1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/small2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; --limitedVals = LIMIT E 5; --dump limitedVals; store E into '$outputDir/smalljoin2' using org.apache.hadoop.zebra.pig.TableStorer(''); Here is the stacktrace: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789408#action_12789408 ] Jeff Zhang commented on PIG-1110: - Hi Richard, I checked your code and find that pig use the output file extension to determined whether the output should be compressed or non-compressed ( the code in trunk also do like this). I do not think this method is good enough, because it enforce user to add .bz2 to as the extension of output file. My suggestion is as follows: Add a new constructor to PigStorage like PigStorage(String delimiter, String extension) the extension indicate what file format user want to store Handle compressed file formats -- Gz, BZip with the new proposal Key: PIG-1110 URL: https://issues.apache.org/jira/browse/PIG-1110 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1110.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789411#action_12789411 ] Yan Zhou commented on PIG-1145: --- All failed test cases are PIG tests and look like enviromental. I reran the first failed test, TestJoin, in my local cluser, and it passes cleanly. [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789425#action_12789425 ] Chao Wang commented on PIG-1145: The patch looks good +1. [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning
[ https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1142: Attachment: PIG-1142-3.patch No code change in PIG-1142-3.patch, except for more test cases. Got NullPointerException merge join with pruning Key: PIG-1142 URL: https://issues.apache.org/jira/browse/PIG-1142 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch Here is my pig script: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/small1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/small2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; --limitedVals = LIMIT E 5; --dump limitedVals; store E into '$outputDir/smalljoin2' using org.apache.hadoop.zebra.pig.TableStorer(''); Here is the stacktrace: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1145) [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
[ https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1145: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to Apache trunk and 0.6 branch. [zebra] merge join on large table ( 100,000.000 rows zebra table) failed Key: PIG-1145 URL: https://issues.apache.org/jira/browse/PIG-1145 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0, 0.7.0 Attachments: PIG-1145.patch, PIG-1145.patch, PIG-1145.patch Pig script : register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/unsorted1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/unsorted2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/sorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/sorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; store joina into '$outputDir/join1' using org.apache.hadoop.zebra.pig.TableStorer(''); ~ ~ ~ == stacktrace: org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error processing right input during merge join at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.io.EOFException: No key-value to read at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590) at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854) at org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035) at org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082) at org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105) at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1142) Got NullPointerException merge join with pruning
[ https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789440#action_12789440 ] Alan Gates commented on PIG-1142: - Additional test cases look good. +1 Got NullPointerException merge join with pruning Key: PIG-1142 URL: https://issues.apache.org/jira/browse/PIG-1142 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch Here is my pig script: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/small1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/small2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; --limitedVals = LIMIT E 5; --dump limitedVals; store E into '$outputDir/smalljoin2' using org.apache.hadoop.zebra.pig.TableStorer(''); Here is the stacktrace: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1142) Got NullPointerException merge join with pruning
[ https://issues.apache.org/jira/browse/PIG-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1142: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) New patch committed to both trunk and 0.6 branch. Got NullPointerException merge join with pruning Key: PIG-1142 URL: https://issues.apache.org/jira/browse/PIG-1142 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1142-1.patch, PIG-1142-2.patch, PIG-1142-3.patch Here is my pig script: register $zebraJar; --fs -rmr $outputDir a1 = LOAD '$inputDir/small1' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); a2 = LOAD '$inputDir/small2' USING org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2'); sort1 = order a1 by str2; sort2 = order a2 by str2; --store sort1 into '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); --store sort2 into '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]'); rec1 = load '$outputDir/smallsorted11' using org.apache.hadoop.zebra.pig.TableLoader(); rec2 = load '$outputDir/smallsorted21' using org.apache.hadoop.zebra.pig.TableLoader(); joina = join rec1 by str2, rec2 by str2 using merge ; E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2; --limitedVals = LIMIT E 5; --dump limitedVals; store E into '$outputDir/smalljoin2' using org.apache.hadoop.zebra.pig.TableStorer(''); Here is the stacktrace: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.extractKeysFromTuple(POMergeJoin.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:341) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1147) Zebra Docs for Pig 0.6.0
Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0
[ https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-1147: - Attachment: zebra.jpg Zerbra image file. (1) Add this file TRUNK C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs\images (2) Add this file to branch-0.6 http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/images/ Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: zebra.jpg Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0
[ https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-1147: - Attachment: Zebra.patch Patch file (1) Apply this patch to the TRUNK C:\__Pig\Trunk\src\docs\src\documentation\content\xdocs (2) Apply this patch to the branch-0.6 http://svn.apache.org/repos/asf/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/ NOTE: No new test code required; changes to documentation only. Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: zebra.jpg, Zebra.patch Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0
[ https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-1147: - Status: Patch Available (was: Open) (1) Add Zebra image file to Pig TRUNK and branch-0.6 (2) Apply Zebra patch to Pig TRUNK and branch-0.6 Note: No new test code required; changes to documentation only. Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: zebra.jpg, Zebra.patch Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1144: Attachment: PIG-1144-3.patch Change the patch to take mapred.reduce.tasks into account. The hierarchy for determining the parallelism is: 1. PARALLEL keywords 2. default_parallel 3. mapred.reduce.tasks system property 4. default value: 1 set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1144: Status: Open (was: Patch Available) set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1144: Status: Patch Available (was: Open) set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1086) Nested sort by * throw exception
[ https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1086: Resolution: Fixed Fix Version/s: 0.7.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed. Thanks Richard! Nested sort by * throw exception Key: PIG-1086 URL: https://issues.apache.org/jira/browse/PIG-1086 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1086.patch The following script fail: A = load '1.txt' as (a0, a1, a2); B = group A by a0; C = foreach B { D = order A by *; generate group, D;}; explain C; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1147) Zebra Docs for Pig 0.6.0
[ https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1147: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed to both trunk and 0.6 branch. Thanks Corinne! Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: zebra.jpg, Zebra.patch Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789623#action_12789623 ] Hadoop QA commented on PIG-1106: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427716/frjoin-nonspill.patch against trunk revision 889346. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/118/console This message is automatically generated. FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789624#action_12789624 ] Olga Natkovich commented on PIG-1106: - Test failures are not due to this patch. Also, I don't believe it is easy to test with an automatic test but I believe Ankit tested it manually. I will review the code and run test-commit + FRJoin tests before committing the patch. FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs
[ https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1085: Fix Version/s: 0.6.0 Pass JobConf and UDF specific configuration information to UDFs --- Key: PIG-1085 URL: https://issues.apache.org/jira/browse/PIG-1085 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.6.0 Attachments: udfconf-2.patch, udfconf.patch Users have long asked for a way to get the JobConf structure in their UDFs. It would also be nice to have a way to pass properties between the front end and back end so that UDFs can store state during parse time and use it at runtime. This patch does part of what is proposed in PIG-602, but not all of it. It does not provide a way to give user specified configuration files to UDFs. So I will mark 602 as depending on this bug, but it isn't a duplicate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1147) Zebra Docs for Pig 0.6.0
[ https://issues.apache.org/jira/browse/PIG-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789641#action_12789641 ] Hadoop QA commented on PIG-1147: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427764/Zebra.patch against trunk revision 889870. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/119/console This message is automatically generated. Zebra Docs for Pig 0.6.0 Key: PIG-1147 URL: https://issues.apache.org/jira/browse/PIG-1147 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: zebra.jpg, Zebra.patch Zebra docs for Pig 0.6.0 (1) XML files UPDATE: site.xml - updated to include Zebra items in Pig menu NEW: zebra_mapreduce.xml zebra_overview.xml zebra_pig.xml zebra_reference.xml zebra_stream.xml zebra_users.xm. (2) IMAGE file zebra.jpg -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-trunk #645
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/645/changes Changes: [daijy] PIG-1142: Got NullPointerException merge join with pruning [yanz] PIG-1145: Merge Join on Large Table throws an EOF exception (yanz) -- [...truncated 228593 lines...] [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.split. blk_8744047869084643086_1014 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_8744047869084643086_1014 src: /127.0.0.1:34013 dest: /127.0.0.1:57637 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_8744047869084643086_1014 src: /127.0.0.1:38898 dest: /127.0.0.1:46559 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_8744047869084643086_1014 src: /127.0.0.1:53824 dest: /127.0.0.1:56621 [junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:53824, dest: /127.0.0.1:56621, bytes: 1589, op: HDFS_WRITE, cliID: DFSClient_-1419330198, srvID: DS-2034772238-127.0.1.1-56621-1260583896758, blockid: blk_8744047869084643086_1014 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 0 for block blk_8744047869084643086_1014 terminating [junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:38898, dest: /127.0.0.1:46559, bytes: 1589, op: HDFS_WRITE, cliID: DFSClient_-1419330198, srvID: DS-815252464-127.0.1.1-46559-1260583896308, blockid: blk_8744047869084643086_1014 [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:56621 is added to blk_8744047869084643086_1014 size 1589 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 1 for block blk_8744047869084643086_1014 terminating [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:46559 is added to blk_8744047869084643086_1014 size 1589 [junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:34013, dest: /127.0.0.1:57637, bytes: 1589, op: HDFS_WRITE, cliID: DFSClient_-1419330198, srvID: DS-1844402489-127.0.1.1-57637-1260583897211, blockid: blk_8744047869084643086_1014 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 2 for block blk_8744047869084643086_1014 terminating [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57637 is added to blk_8744047869084643086_1014 size 1589 [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: DIR* NameSystem.completeFile: file /tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.split is closed by DFSClient_-1419330198 [junit] 09/12/12 02:12:08 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=create src=/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml dst=nullperm=hudson:supergroup:rw-r--r-- [junit] 09/12/12 02:12:08 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=setPermission src=/tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml dst=nullperm=hudson:supergroup:rw-r--r-- [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_20091212021137281_0002/job.xml. blk_2666416874129524588_1015 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_2666416874129524588_1015 src: /127.0.0.1:53825 dest: /127.0.0.1:56621 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_2666416874129524588_1015 src: /127.0.0.1:34017 dest: /127.0.0.1:57637 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: Receiving block blk_2666416874129524588_1015 src: /127.0.0.1:59529 dest: /127.0.0.1:41270 [junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:59529, dest: /127.0.0.1:41270, bytes: 48759, op: HDFS_WRITE, cliID: DFSClient_-1419330198, srvID: DS-486837220-127.0.1.1-41270-1260583895816, blockid: blk_2666416874129524588_1015 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 0 for block blk_2666416874129524588_1015 terminating [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:41270 is added to blk_2666416874129524588_1015 size 48759 [junit] 09/12/12 02:12:08 INFO DataNode.clienttrace: src: /127.0.0.1:34017, dest: /127.0.0.1:57637, bytes: 48759, op: HDFS_WRITE, cliID: DFSClient_-1419330198, srvID: DS-1844402489-127.0.1.1-57637-1260583897211, blockid: blk_2666416874129524588_1015 [junit] 09/12/12 02:12:08 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57637 is added to blk_2666416874129524588_1015 size 48759 [junit] 09/12/12 02:12:08 INFO datanode.DataNode: PacketResponder 1 for block blk_2666416874129524588_1015 terminating
[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789668#action_12789668 ] Thejas M Nair commented on PIG-965: --- Review comments: * The regex will always be on the rhs. So we don't need the code/classes which tries to determine which side has the regular expression based on which side has constant. * in determineBestRegexMethod, need to add (? to the list of regex strings not supported in dk.bricks (in javaRegexOnly) . It has special meanings in java regex, which is not honored by dk.brics . * in determineBestRegexMethod, We are dealing with cases like \d (choose java regex), \\d (choose dk.brics), but not dealing with \\\d (which should be choose java regex). ie we need to go back until we find a non '\' char. * in RegexInit.compile(..), the following message is more appropriate at debug level, not at info . At info level, it might also confuse the user. +log.info(Got an IllegalArgumentException for Pattern: + pattern ); +log.info(e.getMessage()); +log.info(Switching to java.util.regex ); * The following comment in PORegex.java seems to be out of place . // This is a BinaryComparisonOperator hence there can only be two inputs PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.