[jira] Commented: (PIG-1166) A bit change of the interface of Tuple DataBag ( make the set and append method return this)
[ https://issues.apache.org/jira/browse/PIG-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793118#action_12793118 ] Jeff Zhang commented on PIG-1166: - A better example of illustrating my idea is to build a DataBag. The current method: {code} BagFactory BAGFACTORY = BagFactory.getInstance(); TupleFactory TUPLEFACTORY = TupleFactory.getInstance(); DataBag bag = BAGFACTORY.newDefaultBag(); Tuple tuple_1 = TUPLEFACTORY.newTuple(1); tuple_1.set(0, item_1); bag.add(tuple_1); Tuple tuple_2 = TUPLEFACTORY.newTuple(1); tuple_2.set(0, item_2); bag.add(tuple_2); {code} and if we change the interface, we can write the code as following: {code} BagFactory BAGFACTORY = BagFactory.getInstance(); TupleFactory TUPLEFACTORY = TupleFactory.getInstance(); DataBag bag = BAGFACTORY.newDefaultBag(); bag.add(TUPLEFACTORY.newTuple(1).set(0,item_1)).add(TUPLEFACTORY.newTuple(1).set(0,item_2)); {code} The second piece of code snippet is more readable and concise in my opinion. A bit change of the interface of Tuple DataBag ( make the set and append method return this) -- Key: PIG-1166 URL: https://issues.apache.org/jira/browse/PIG-1166 Project: Pig Issue Type: Improvement Reporter: Jeff Zhang Priority: Minor When people write unit test for UDF, they always need to build a tuple or bag. If we change the interface of Tuple and DataBag, make the set and append method return this, it can decrease the code size. e.g. Now people have to write the following code to build a Tuple: {code} Tuple tuple=TupleFactory.getInstance().newTuple(3); tuple.set(0,item_0); tuple.set(1,item_1); tuple.set(2,item_2); {code} If we change the interface, make the set and append method return this, we can rewrite the above code like this: {code} Tuple tuple=TupleFactory.getInstance().newTuple(3); tuple.set(0,item_0).set(1,item_1).set(2,item_2); {code} This interface change won't have back compatibility problem and I think there's no performance problem too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
[ https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793121#action_12793121 ] George Mavromatis commented on PIG-919: --- This was closed Fixed when it should have been closed Won't Fix or Later. Can we then resolve it with the correct resolution (Won't Fix or Later)? Where are you seeing this error? I am seeing it in product code that I cannot refer to here. It has occurred twice, one instance of which was referred to this ticket and closed. I will send you more information offline. Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Assignee: Viraj Bhat Fix For: 0.3.0 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793269#action_12793269 ] Thejas M Nair commented on PIG-1149: +1 to the lsr branch version. But the FIXME comment in the test case is not correct. There does not have to be 1 samples sampled for every map, if the number of rows are very small. Though this behavior is different from earlier version of the trunk version of poisson sampler, it satisfies the requirements as per http://wiki.apache.org/pig/PigSampler and PIG-1062. I can remove the FIXME comment as part of the patch I am going to submit to fix the other test case. Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-1110. - Resolution: Fixed Hadoop Flags: [Incompatible change, Reviewed] (was: [Incompatible change]) +1, Patch committed to load-store-redesign branch - thanks Richard! Handle compressed file formats -- Gz, BZip with the new proposal Key: PIG-1110 URL: https://issues.apache.org/jira/browse/PIG-1110 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1110.patch, PIG-1110.patch, PIG_1110_Jeff.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1094) Fix unit tests corresponding to source changes so far
[ https://issues.apache.org/jira/browse/PIG-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793286#action_12793286 ] Pradeep Kamath commented on PIG-1094: - +1, PIG-1094_4.patch checked in, thanks Richard! Fix unit tests corresponding to source changes so far - Key: PIG-1094 URL: https://issues.apache.org/jira/browse/PIG-1094 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1094.patch, PIG-1094_2.patch, PIG-1094_3.patch, PIG-1094_4.patch The check-in's so far on load-store-redesign branch have nor addressed unit test failures due to interface changes. This jira is to track the task of making the common case unit tests work with the new interfaces. Some aspects of the new proposal like using LoadCaster interface for casting, making local mode work have not been completed yet. Tests which are failing due to those reasons will not be fixed in this jira and addressed in the jiras corresponding to those tasks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by
[ https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1165: Attachment: PIG-1165-1.patch Signature of loader does not set correctly for order by --- Key: PIG-1165 URL: https://issues.apache.org/jira/browse/PIG-1165 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1165-1.patch In pig, we need to set signature for each LoadFunc. Currently, we use alias of the LOAD statement in Pig script of the signature of the LoadFunc. One use case we have is in LoadFunc, we use signature to retrieve pruned columns of each specific loader. However, in order by statement, we do not set signature for the loader correctly. In this case, we do not prune the loader correctly. For example, the following script produce wrong result: {code} a = load '1.txt' as (a0, a1); b = order a by a1; c = order b by a1; d = foreach c generate a1; dump d; {code} 1.txt: {code} 1 a 2 b 3 c 6 d 5 e {code} expected result: a b c d e current result: 1 2 3 5 6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by
[ https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1165: Status: Patch Available (was: Open) Signature of loader does not set correctly for order by --- Key: PIG-1165 URL: https://issues.apache.org/jira/browse/PIG-1165 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1165-1.patch In pig, we need to set signature for each LoadFunc. Currently, we use alias of the LOAD statement in Pig script of the signature of the LoadFunc. One use case we have is in LoadFunc, we use signature to retrieve pruned columns of each specific loader. However, in order by statement, we do not set signature for the loader correctly. In this case, we do not prune the loader correctly. For example, the following script produce wrong result: {code} a = load '1.txt' as (a0, a1); b = order a by a1; c = order b by a1; d = foreach c generate a1; dump d; {code} 1.txt: {code} 1 a 2 b 3 c 6 d 5 e {code} expected result: a b c d e current result: 1 2 3 5 6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793296#action_12793296 ] Olga Natkovich commented on PIG-1143: - +1 on the code changes. There is a extra debug trace in the code that I will remove as part of the commit Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Locking trunk for commits to merge on load-store-redesign branch
Hi, PIG-1143 and PIG-1149 need special handling on the load-store-redesign branch. PIG-1143 should not be applied to the branch since the code is not applicable and for PIG-1149 there is a separate patch. I am beginning a merge of load-store-redesign branch with current head of trunk. These two patches will be committed to trunk once I complete the merge and in the svn commit message for the merge on load-store-redesign branch, I will record the revision after these two patches are committed. This is so that the next time we merge from trunk to load-store-redesign branch we merge from the point after these patches. This mail is to let all committers know that they should hold off commits till this process is done at which point I will send an all clear email. Thanks, Pradeep
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1143: Fix Version/s: 0.6.0 Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793321#action_12793321 ] Olga Natkovich commented on PIG-1149: - patch pig_1149.patch is committed to the trunk. Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793322#action_12793322 ] Olga Natkovich commented on PIG-1143: - patch committed to the trunk. Will commit to 0.6 branch tomorrow. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
DONE - trunk open for commits RE: Locking trunk for commits to merge on load-store-redesign branch
Hi, The process outlined below is now completed and the trunk is open for commits which do not conflict with load-store-redesign branch. Thanks, Pradeep -Original Message- From: Pradeep Kamath [mailto:prade...@yahoo-inc.com] Sent: Monday, December 21, 2009 10:50 AM To: pig-dev@hadoop.apache.org Subject: Locking trunk for commits to merge on load-store-redesign branch Hi, PIG-1143 and PIG-1149 need special handling on the load-store-redesign branch. PIG-1143 should not be applied to the branch since the code is not applicable and for PIG-1149 there is a separate patch. I am beginning a merge of load-store-redesign branch with current head of trunk. These two patches will be committed to trunk once I complete the merge and in the svn commit message for the merge on load-store-redesign branch, I will record the revision after these two patches are committed. This is so that the next time we merge from trunk to load-store-redesign branch we merge from the point after these patches. This mail is to let all committers know that they should hold off commits till this process is done at which point I will send an all clear email. Thanks, Pradeep
[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1149: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch for lsr branch also committed, thanks Dmitriy! Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)
[ https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1158: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed to the trunk. Thanks, Richard pig command line -M option doesn't support table union correctly (comma seperated paths) Key: PIG-1158 URL: https://issues.apache.org/jira/browse/PIG-1158 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1158.patch for example, load (1.txt,2.txt) USING org.apache.hadoop.zebra.pig.TableLoader() i see this errror from stand out: [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1141) Make streaming work with the new load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1141: -- Attachment: PIG-1141.patch This patch made changes following Alan's comments. Make streaming work with the new load-store interfaces --- Key: PIG-1141 URL: https://issues.apache.org/jira/browse/PIG-1141 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1166) A bit change of the interface of Tuple DataBag ( make the set and append method return this)
[ https://issues.apache.org/jira/browse/PIG-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793350#action_12793350 ] Dmitriy V. Ryaboy commented on PIG-1166: +1 to the idea, and we don't have to stop at Tuple and Bag factories. There are plenty of other places that this can be useful in (like all of the Logical and Physical operators). A bit change of the interface of Tuple DataBag ( make the set and append method return this) -- Key: PIG-1166 URL: https://issues.apache.org/jira/browse/PIG-1166 Project: Pig Issue Type: Improvement Reporter: Jeff Zhang Priority: Minor When people write unit test for UDF, they always need to build a tuple or bag. If we change the interface of Tuple and DataBag, make the set and append method return this, it can decrease the code size. e.g. Now people have to write the following code to build a Tuple: {code} Tuple tuple=TupleFactory.getInstance().newTuple(3); tuple.set(0,item_0); tuple.set(1,item_1); tuple.set(2,item_2); {code} If we change the interface, make the set and append method return this, we can rewrite the above code like this: {code} Tuple tuple=TupleFactory.getInstance().newTuple(3); tuple.set(0,item_0).set(1,item_1).set(2,item_2); {code} This interface change won't have back compatibility problem and I think there's no performance problem too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1159) merge join right side table does not support comma seperated paths
[ https://issues.apache.org/jira/browse/PIG-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1159: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed. Thanks, Richard merge join right side table does not support comma seperated paths -- Key: PIG-1159 URL: https://issues.apache.org/jira/browse/PIG-1159 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1159.patch For example this is my script:(join_jira1.pig) register /grid/0/dev/hadoopqa/jars/zebra.jar; --a1 = load '1.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); --a2 = load '2.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); --sort1 = order a1 by a parallel 6; --sort2 = order a2 by a parallel 5; --store sort1 into 'asort1' using org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]'); --store sort2 into 'asort2' using org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]'); --store sort1 into 'asort3' using org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]'); --store sort2 into 'asort4' using org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]'); joinl = LOAD 'asort1,asort2' USING org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted'); joinr = LOAD 'asort3,asort4' USING org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted'); joina = join joinl by a, joinr by a using merge ; dump joina; == here is the log: Backend error message - java.lang.IllegalArgumentException: Pathname /user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4 from hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4 is not a valid DFS filename. at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:648) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147) at org.apache.pig.impl.io.FileLocalizer.fullPath(FileLocalizer.java:534) at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:338) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:398) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Pig Stack Trace --- ERROR 6015: During execution, encountered a Hadoop error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias joina at org.apache.pig.PigServer.openIterator(PigServer.java:482) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:386) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: During execution, encountered a Hadoop error. at .apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158) at .apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at .apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)at
[jira] Commented: (PIG-1141) Make streaming work with the new load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793354#action_12793354 ] Alan Gates commented on PIG-1141: - +1, changes look good. Make streaming work with the new load-store interfaces --- Key: PIG-1141 URL: https://issues.apache.org/jira/browse/PIG-1141 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793369#action_12793369 ] Olga Natkovich commented on PIG-1102: - A few questions/comments on the patch: (1) I think the count should default to 0, not -1. (2) Does increment of count have to be combined with warn statement. Does this mean that users will see this many warnings? If so, should we combine this with spill message we already print? (3) I thought we discussed having increment per buffer not per record and to approximate that based on the buffer size. I did not see the code that did this. (4) I don't think you correctly separated bags that practively spill vs the bags that are spilled by memory manager. All the bags created by DefaultBagFactory get registerf with SpillableMemoryManager and belong to the second category. Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1141) Make streaming work with the new load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-1141. - Resolution: Fixed Hadoop Flags: [Incompatible change, Reviewed] Patch committed to load-store-redesign branch, thanks Richard! Make streaming work with the new load-store interfaces --- Key: PIG-1141 URL: https://issues.apache.org/jira/browse/PIG-1141 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1167) [zebra] Zebra does not support Hadoop Globs
[ https://issues.apache.org/jira/browse/PIG-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793419#action_12793419 ] Yan Zhou commented on PIG-1167: --- Zebra's TableLoader, the implementation of PIG's LoadFunc, does support glob by design. But Map/Reduce interface does not. TableLoader just expands the glob and passes the paths to Map/Reduce interface so a union of underlying Zebra tables will be loaded. It looks like that not enough test coverage is present in this area. [zebra] Zebra does not support Hadoop Globs --- Key: PIG-1167 URL: https://issues.apache.org/jira/browse/PIG-1167 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Jay Tang Pssing the following path to Zebra causing error but works with Hadoop directly: /projects/FETL/sample/ABF1/{2009120204} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1167) [zebra] Zebra does not support Hadoop Globs
[ https://issues.apache.org/jira/browse/PIG-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793420#action_12793420 ] Olga Natkovich commented on PIG-1167: - Just to clarify. Pig does support it for all loaders that use default pig slicer. Zebra, however, uses its own and that's why it is not getting this functionality for free. [zebra] Zebra does not support Hadoop Globs --- Key: PIG-1167 URL: https://issues.apache.org/jira/browse/PIG-1167 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Jay Tang Pssing the following path to Zebra causing error but works with Hadoop directly: /projects/FETL/sample/ABF1/{2009120204} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1165) Signature of loader does not set correctly for order by
[ https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793428#action_12793428 ] Hadoop QA commented on PIG-1165: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428644/PIG-1165-1.patch against trunk revision 892939. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/console This message is automatically generated. Signature of loader does not set correctly for order by --- Key: PIG-1165 URL: https://issues.apache.org/jira/browse/PIG-1165 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1165-1.patch In pig, we need to set signature for each LoadFunc. Currently, we use alias of the LOAD statement in Pig script of the signature of the LoadFunc. One use case we have is in LoadFunc, we use signature to retrieve pruned columns of each specific loader. However, in order by statement, we do not set signature for the loader correctly. In this case, we do not prune the loader correctly. For example, the following script produce wrong result: {code} a = load '1.txt' as (a0, a1); b = order a by a1; c = order b by a1; d = foreach c generate a1; dump d; {code} 1.txt: {code} 1 a 2 b 3 c 6 d 5 e {code} expected result: a b c d e current result: 1 2 3 5 6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1165) Signature of loader does not set correctly for order by
[ https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793434#action_12793434 ] Olga Natkovich commented on PIG-1165: - +1; changes look good! Signature of loader does not set correctly for order by --- Key: PIG-1165 URL: https://issues.apache.org/jira/browse/PIG-1165 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1165-1.patch In pig, we need to set signature for each LoadFunc. Currently, we use alias of the LOAD statement in Pig script of the signature of the LoadFunc. One use case we have is in LoadFunc, we use signature to retrieve pruned columns of each specific loader. However, in order by statement, we do not set signature for the loader correctly. In this case, we do not prune the loader correctly. For example, the following script produce wrong result: {code} a = load '1.txt' as (a0, a1); b = order a by a1; c = order b by a1; d = foreach c generate a1; dump d; {code} 1.txt: {code} 1 a 2 b 3 c 6 d 5 e {code} expected result: a b c d e current result: 1 2 3 5 6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by
[ https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1165: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed to both trunk and 0.6 branch. Signature of loader does not set correctly for order by --- Key: PIG-1165 URL: https://issues.apache.org/jira/browse/PIG-1165 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1165-1.patch In pig, we need to set signature for each LoadFunc. Currently, we use alias of the LOAD statement in Pig script of the signature of the LoadFunc. One use case we have is in LoadFunc, we use signature to retrieve pruned columns of each specific loader. However, in order by statement, we do not set signature for the loader correctly. In this case, we do not prune the loader correctly. For example, the following script produce wrong result: {code} a = load '1.txt' as (a0, a1); b = order a by a1; c = order b by a1; d = foreach c generate a1; dump d; {code} 1.txt: {code} 1 a 2 b 3 c 6 d 5 e {code} expected result: a b c d e current result: 1 2 3 5 6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception
[ https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1153: -- Status: Open (was: Patch Available) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception - Key: PIG-1153 URL: https://issues.apache.org/jira/browse/PIG-1153 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Yan Zhou Attachments: PIG-1153.patch, PIG-1153.patch The following code sample: String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, r3:record(f3:float, f4)); String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4]; Partition p = new Partition(schema.toString(), strStorage, null); gives the following exception: org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set on the same field: r2.f5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception
[ https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1153: -- Status: Patch Available (was: Open) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception - Key: PIG-1153 URL: https://issues.apache.org/jira/browse/PIG-1153 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Yan Zhou Attachments: PIG-1153.patch, PIG-1153.patch The following code sample: String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, r3:record(f3:float, f4)); String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4]; Partition p = new Partition(schema.toString(), strStorage, null); gives the following exception: org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set on the same field: r2.f5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1164) [zebra]smoke test
[ https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Huang updated PIG-1164: Attachment: (was: smoke.patch) [zebra]smoke test - Key: PIG-1164 URL: https://issues.apache.org/jira/browse/PIG-1164 Project: Pig Issue Type: Test Affects Versions: 0.6.0 Reporter: Jing Huang Fix For: 0.7.0 Change zebra build.xml file to add smoke target. And env.sh and run script under zebra/src/test/smoke dir -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1168) Dump produces wrong results
Dump produces wrong results --- Key: PIG-1168 URL: https://issues.apache.org/jira/browse/PIG-1168 Project: Pig Issue Type: Bug Reporter: Ankur For a map-only job, dump just re-executes every pig-latin statement from the begininng assuming that they would produce same result. the assumption is not valid if there are UDFs that are invoked. Consider the following script:- raw = LOAD '$input' USING PigStorage() AS (text_string:chararray); DUMP raw; ccm = FOREACH raw GENERATE MyUDF(text_string); DUMP ccm; bug = FOREACH ccm GENERATE ccmObj; DUMP bug; The UDF MyUDF generates a tuple with one of the fields being a randomly generated UUID. So even though one would expect relations 'ccm' and 'bug' to contain identical data, they are different because of re-execution from the begininng. This breaks the application logic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception
[ https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793522#action_12793522 ] Hadoop QA commented on PIG-1153: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428683/PIG-1153.patch against trunk revision 893053. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/console This message is automatically generated. [zebra] spliting columns at different levels in a complex record column into different column groups throws exception - Key: PIG-1153 URL: https://issues.apache.org/jira/browse/PIG-1153 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Yan Zhou Attachments: PIG-1153.patch, PIG-1153.patch The following code sample: String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, r3:record(f3:float, f4)); String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4]; Partition p = new Partition(schema.toString(), strStorage, null); gives the following exception: org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set on the same field: r2.f5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.