[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906871#action_12906871 ] Doug Cutting commented on PIG-794: -- Jeff, what version of Avro are you using? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906874#action_12906874 ] Thejas M Nair commented on PIG-1601: +1 Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1601. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1595: --- Attachment: PIG-1595.2.patch With changes in PIG-1595.1.patch, the column name gets propagated in the schema , so I have updated the test case to use different column names in the relation used as scalar so that it does not conflict with other column being projected. All testScalarAlias unit tests pass. test-patch results - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. casting relation to scalar- problem with handling of data from non PigStorage loaders - Key: PIG-1595 URL: https://issues.apache.org/jira/browse/PIG-1595 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1595.1.patch, PIG-1595.2.patch If load functions that don't follow the same bytearray format as PigStorage for other supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results. The root cause of the problem is that there is a real dependency between the ReadScalars udf that returns the scalar value and the LogicalOperator that acts as its input. But the logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer, the order in which schema is reset and evaluated does not take this into consideration. If the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf, the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other supported types might not be same for the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906932#action_12906932 ] Daniel Dai commented on PIG-1595: - +1 for the test failure fix. casting relation to scalar- problem with handling of data from non PigStorage loaders - Key: PIG-1595 URL: https://issues.apache.org/jira/browse/PIG-1595 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1595.1.patch, PIG-1595.2.patch If load functions that don't follow the same bytearray format as PigStorage for other supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results. The root cause of the problem is that there is a real dependency between the ReadScalars udf that returns the scalar value and the LogicalOperator that acts as its input. But the logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer, the order in which schema is reset and evaluated does not take this into consideration. If the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf, the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other supported types might not be same for the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906952#action_12906952 ] Thejas M Nair commented on PIG-1595: PIG-1595.2.patch committed to trunk and 0.8 branch. casting relation to scalar- problem with handling of data from non PigStorage loaders - Key: PIG-1595 URL: https://issues.apache.org/jira/browse/PIG-1595 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1595.1.patch, PIG-1595.2.patch If load functions that don't follow the same bytearray format as PigStorage for other supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results. The root cause of the problem is that there is a real dependency between the ReadScalars udf that returns the scalar value and the LogicalOperator that acts as its input. But the logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer, the order in which schema is reset and evaluated does not take this into consideration. If the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf, the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other supported types might not be same for the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Does Pig Re-Use FileInputLoadFuncs Objects?
I'm not 100% sure I understand the question. Are you asking if it re- uses instances of a given load or store function? It should not. Alan. On Aug 31, 2010, at 7:28 PM, Russell Jurney wrote: Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc objects? We suspect state is being retained between different stores, but we don't actually know this. Figured I'd ask to verify the hunch. Our load func for our in-house format works fine with Pig scripts normally... but I have a pig script that looks like this: LOAD thing1 SPLIT thing1 INTO thing2, thing3 STORE thing2 INTO thing2 STORE thing3 INTO thing3 LOAD thing4 SPLIT thing4 INTO thing5, thing6 STORE thing5 INTO thing5 STORE thing6 INTO thing6 And it works via PigStorage, but not via our FileInputLoadFunc. Russ
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907049#action_12907049 ] Jeff Zhang commented on PIG-794: Doug, I am using avro trunk revision 988779 Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-11.patch PIG-1178-11.patch change the layout of explain, error code and comments, etc. No real functional changes. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 11 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907061#action_12907061 ] Daniel Dai commented on PIG-1178: - PIG-1178-11.patch committed to both trunk and 0.8 branch. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.