[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904551#action_12904551 ] Jeff Zhang commented on PIG-794: I did some experiment on Avro, Avro_Storage_2.patch is the detail implementation. Here I use avro as the data storage between map reduce jobs to replace InterStorage which has been optimized compared to BinStorage. I use a simple pig script which will been translate into 2 mapred jobs {code} a = load '/a.txt'; b = load '/b.txt'; c = join a by $0, b by $0; d = group c by $0; dump d; {code} The following table shows my experiment result (1 master + 3 slaves) || Storage || Time spent on job_1 || Output size of job_1 || Mapper task number of job_2 || Time spent on job_2 || Total spent time on pig script | AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| | InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec| The experiment shows that AvroStorage has more compact format than InterStorage ( according the output size of job_1), but has more overhead on serialization ( according the time spent on job_1). I think the time spent on job_2 using AvroStorage is less than that using InterStorage is because the input size of job_2 (the output of job_1) which using AvroStorage is much less than that using InterStorage, so it need less mapper task. Overall, AvroStorage is not so good as expected. One reason is maybe I do not use Avro's API correctly (hope avro guys can review my code), another reason is maybe avro's serialization performance is not so good. BTW, I use avro trunk. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-794: --- Attachment: AvroStorage_3.patch Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615 ] Dmitriy V. Ryaboy commented on PIG-794: --- Jeff, have you checkoed out Scott Carey's work here: https://issues.apache.org/jira/browse/AVRO-592 ? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904674#action_12904674 ] Scott Carey commented on PIG-794: - AVRO-592 creates an AvroStorage class for writing and reading M/R inputs and outputs but does not deal with intermediate M/R output. I have some updates to that in progress that simplify it more. Some aspects may be re-usable for this too. One thing to note is that Avro cannot be completely optimal for intermediate M/R output because the Hadoop API for this has a performance flaw that prevents efficient use of buffers and input/output streams there. This would affect InterStorage as well though. I'll take a look at the patch here and see if I can see any performance optimizations. Note, that there are still several performance optimizations left to do in Avro itself. For example, the BinaryDecoder has been optimized, but not the Encoder yet. Also, I'm somewhat blocked with AVRO-592 due to lack of Pig 0.7 maven availability. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904680#action_12904680 ] Scott Carey commented on PIG-794: - So a summary of the differences I can see quickly are: h5. Schema usage: This creates a 'generic' Avro schema that can be used for any pig data. Each field in a Tuple is a Union of all possible pig types, and each Tuple is a list of fields. It does not preserve the field names or types -- these are not important for intermediate data anyway. AVRO-592 translates the Pig schema into a specific Avro schema that persists the field names and types, so that: STORE foo INTO 'file' USING AvroStorage(); Will create a file that foo2 = LOAD 'file' USING AvroStorage(); will be able to re-create the exact schema for use in a script. h5. Serialization and Deserialization: This uses the same style as Avro's GenericRecord, which traverses the schema on the fly and writes fields for each record. AVRO-592 constructs a state machine for each specific schema to optimally traverse a Tuple to serialize a record or create a Tuple when deserializing. This should be faster but the code is definitely harder to read (but easy to unit test -- AVRO-592 has 98% unit test code coverage on that portion). Integrating these should not be too hard. I'll try and put my latest version of AVRO-592 up there late today or tomorrow. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904687#action_12904687 ] Doug Cutting commented on PIG-794: -- A few comments about the attached code: - is there a reason you don't subclass GenericDatumReader and GenericDatumWriter, overriding readRecord() and writeRecord()? That would simplify things and better guarantee that you're conforming to a schema. Currently, e.g., your writeMap() doesn't appear to write a valid Avro map, writeArray() doesn't write a valid Avro array, etc., so the data written is not interoperable,. - my guess is that a lot of time is spent in findSchemaIndex(). if that's right, you might improve this in various ways, e.g.: -- sort this by the most common types. the order in Pig's DataType.java is probably a good one. -- try using a static MapClass,Integer cache of indexes - have you run this under a profiler? I don't see where this specifies an Avro schema for Pig data. It's possible to construct a generic schema for all Pig data. In this, a Bag should be record with a single field, an array of Tuples. A Tuple should be a record with a single field, an array of a union of all types. Given such a schema, one could then write a DatumReader/Writer using the control logic of Pig's DataReaderWriter (i.e., a switch based on the value of DataType.findType(), but, instead of calling DataInput/Output methods, use Encoder/Decoder methods with a ValidatingEncoder (at least while debugging) to ensure you conform to that schema. Alternately, in Avro 1.4 (snapshot in Maven now, release this week, hopefully) Avro arrays can be arbitrary Collection implementations. Bag already implements all of the required Collection methods -- clear(), add(), size(), iterator(), so there's no reason I can see for Bag not to implement CollectionTuple. So then one could subclass GenericData, GenericDatumReader Writer, overriding: {code} protected boolean isRecord(Object datum) { return datum instanceof Tuple || datum instanceof Bag; } protected void writeRecord(Schema schema, Object datum, Encoder out) throws IOException { if (TUPLE_NAME.equals(schema.getFullName())) datum = ((Tuple)datum.getAll(); writeArray(schema.getFields().get(0).getType(), datum, out); } protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) throws IOException { Object result; if (TUPLE_NAME.equals(schema.getFullName())) { old = new ArrayList(); result = new Tuple(old); } else { old = result = new Bag(); } readArray(old, expected.getFields().get(0).getType(), in); return result; } {code} Finally, if you knew the schema for the dataset being processed, rather than using a fully-general Pig schema, then you could translate that schema to an Avro schema. This schema in most cases would not likely have a huge, compute-intensive-to-write union in it . Or you might use something like what Scott's proposed in AVRO-592. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1429: Fix Version/s: (was: 0.8.0) Unlinking because we are branching for release today Add Boolean Data Type to Pig Key: PIG-1429 URL: https://issues.apache.org/jira/browse/PIG-1429 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Russell Jurney Attachments: working_boolean.patch Original Estimate: 8h Remaining Estimate: 8h Pig needs a Boolean data type. Pig-1097 is dependent on doing this. I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ plus unit tests to make this work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1549) Provide utility to construct CNF form of predicates
[ https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1549: Fix Version/s: (was: 0.8.0) Unlinking from 0.8 release since we are about to branch Provide utility to construct CNF form of predicates --- Key: PIG-1549 URL: https://issues.apache.org/jira/browse/PIG-1549 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Attachments: 0001-Add-CNF-utility-class.patch Provide utility to construct CNF form of predicates -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup
[ https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1530. - Resolution: Duplicate Xuefu is addressing this issue as part of https://issues.apache.org/jira/browse/PIG-1575. PIG Logical Optimization: Push LOFilter above LOCogroup Key: PIG-1530 URL: https://issues.apache.org/jira/browse/PIG-1530 Project: Pig Issue Type: New Feature Components: impl Reporter: Swati Jain Assignee: Swati Jain Priority: Minor Fix For: 0.8.0 Consider the following: {noformat} A = load 'any file' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'any file' USING PigStorage(',') as (b1:int,b2:int,b3:int); G = COGROUP A by (a1,a2) , B by (b1,b2); D = Filter G by group.$0 + 5 group.$1; explain D; {noformat} In the above example, LOFilter can be pushed above LOCogroup. Note there are some tricky NULL issues to think about when the Cogroup is not of type INNER (Similar to issues that need to be thought through when pushing LOFilter on the right side of a LeftOuterJoin). Also note that typically the LOFilter in user programs will be below a ForEach-Cogroup pair. To make this really useful, we need to also implement LOFilter pushed across ForEach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1494: Unlinking from 0.8 since we are about to branch for release PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Assignee: Swati Jain Priority: Minor The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1582) upgrade commons-logging version with ivy
upgrade commons-logging version with ivy Key: PIG-1582 URL: https://issues.apache.org/jira/browse/PIG-1582 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy
[ https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1582: Attachment: pig-1582.patch upgrade commons-logging version with ivy Key: PIG-1582 URL: https://issues.apache.org/jira/browse/PIG-1582 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-1582.patch to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy
[ https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1582: Status: Patch Available (was: Open) upgrade commons-logging version with ivy Key: PIG-1582 URL: https://issues.apache.org/jira/browse/PIG-1582 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-1582.patch to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy
[ https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1582: Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 Resolution: Fixed upgrade commons-logging version with ivy Key: PIG-1582 URL: https://issues.apache.org/jira/browse/PIG-1582 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: pig-1582.patch to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken
piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: (was: PIG-1583-1.patch) piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: PIG-1583-1.patch piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904783#action_12904783 ] Xuefu Zhang commented on PIG-1583: -- +1 Patch Looks Good. piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1583: - Status: Patch Available (was: Open) piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
[ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904785#action_12904785 ] Olga Natkovich commented on PIG-1506: - This is what we need to document: In the case of GROUP/COGROUP, the data with NULL key from the same input is grouped together. For instance: Input data: joe 5 2.5 sam 3.0 bob 3.5 script: A = load 'small' as (name, age, gpa); B = group A by age; dump B; Output: (5,{(joe,5,2.5)}) (,{(sam,,3.0),(bob,,3.5)}) Note that both records with null age are grouped together. However, data with null keys from different inputs is considered different and will generate multiple tuples in case of cogroup. For instance: Input: Self cogroup on the same input. Script: A = load 'small' as (name, age, gpa); B = load 'small' as (name, age, gpa); C = cogroup A by age, B by age; dump C; Output: (5,{(joe,5,2.5)},{(joe,5,2.5)}) (,{(sam,,3.0),(bob,,3.5)},{}) (,{},{(sam,,3.0),(bob,,3.5)}) Note that there are 2 tuples in the output corresponding to the null key: one that contains tuples from the first input (with no much from the second) and one the other way around. JOIN adds another interesting twist to this because it follows SQL standard which means that JOIN by default represents inner join which through away all the nulls. Input: the same as for COGROUP Script: A = load 'small' as (name, age, gpa); B = load 'small' as (name, age, gpa); C = join A by age, B by age; dump C; Output: (joe,5,2.5,joe,5,2.5) Note that all tuples that had NULL key got filtered out. Need to clarify the difference between null handling in JOIN and COGROUP Key: PIG-1506 URL: https://issues.apache.org/jira/browse/PIG-1506 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904808#action_12904808 ] Alan Gates commented on PIG-1399: - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] Logical Optimizer: Expression optimizor rule Key: PIG-1399 URL: https://issues.apache.org/jira/browse/PIG-1399 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Yan Zhou Fix For: 0.8.0 Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch We can optimize expression in several ways: 1. Constant pre-calculation Example: B = filter A by a0 5+7; = B = filter A by a0 12; 2. Boolean expression optimization Example: B = filter A by not (not(a05) or a10); = B = filter A by a05 and a=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
[ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904819#action_12904819 ] Scott Carey commented on PIG-1506: -- The SQL behavior of the above for an outer join would be to have five rows output -- just like COGROUP would if flattened. So that seems fine to me. A self-join should be the same as a COGROUP with yourself, which is different than a simple GROUP. However, there is a problem with inner join and nulls. Pig JOIN is not like SQL with respect to nulls on multi-column joins. (I have not tried on trunk however) In SQL, if ANY of the columns in a multi-column join is null, the row is not output. Try: {code} A = load 'small' as (name, age, gpa); B = load 'small' as (name, age, gpa); C = join A by (name,age), B by (name,age); dump C; {code} The result for SQL would be one row of the form joe 5 2.5 joe 5 2.5 Need to clarify the difference between null handling in JOIN and COGROUP Key: PIG-1506 URL: https://issues.apache.org/jira/browse/PIG-1506 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1584) deal with inner cogroup
deal with inner cogroup --- Key: PIG-1584 URL: https://issues.apache.org/jira/browse/PIG-1584 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Fix For: 0.9.0 The current implementation of inner in case of cogroup is in conflict with join. We need to decide of whether to fix inner cogroup or just remove the functionality if it is not widely used -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1583: Status: Open (was: Patch Available) submitting to hudson piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1583: Status: Patch Available (was: Open) piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
[ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904829#action_12904829 ] Olga Natkovich commented on PIG-1506: - I verified that 0.8 code does deal correctly with multi-column keys with nulls Need to clarify the difference between null handling in JOIN and COGROUP Key: PIG-1506 URL: https://issues.apache.org/jira/browse/PIG-1506 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Release Note: This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following test.pig script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig was: This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following test.pig script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig [ Show ยป ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following test.pig script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well
[jira] Created: (PIG-1585) Add new properties to help and documentation
Add new properties to help and documentation Key: PIG-1585 URL: https://issues.apache.org/jira/browse/PIG-1585 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 New properties: Compression: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. Combining small files: pig.noSplitCombination - disables combining multiple small files to the block size -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1572) change default datatype when relations are used as scalar to bytearray
[ https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1572: --- Attachment: PIG-1572.2.patch PIG-1572.2.patch - Fixed loss of lineage information in translation during explain call - Added cast on output of ReadScalars so that type information is not lost during schema reset from optimizer. Unit tests and test-patch has passed. Patch is ready for review. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. change default datatype when relations are used as scalar to bytearray -- Key: PIG-1572 URL: https://issues.apache.org/jira/browse/PIG-1572 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1572.1.patch, PIG-1572.2.patch When relations are cast to scalar, the current default type is chararray. This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904848#action_12904848 ] Olga Natkovich commented on PIG-1501: - Ashutosh, The reason it is off by default is because the default compression is gzip which is really slow and most of the time not what you want. Because of the licensing issue with lzo, users need to setup it on their own. Once they do the setup, they can enable the compression. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
[ https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1586: Description: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj was: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a
[jira] Created: (PIG-1587) Cloning utility functions for new logical plan
Cloning utility functions for new logical plan -- Key: PIG-1587 URL: https://issues.apache.org/jira/browse/PIG-1587 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.9.0 We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} ListOperator merge(LogicalExpressionPlan plan); {code} * Merge plan into the current logical expression plan as an independent tree * return the sources of this independent tree LogicalPlan.java {code} LogicalPlan copy(boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Copy all relational operator along with connection * Copy all expression plans inside relational operator, set plan and attachedRelationalOp properly {code} ListOperator merge(LogicalPlan plan); {code} * Merge plan into the current logical plan as an independent tree * return the sources of this independent tree -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
[ https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1586: --- Assignee: Viraj Bhat Viraj volunteered to print the line that pig gets as part of parameter substitution to see if the escapes and quotes are eaten by the shell. Thanks Viraj Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat Assignee: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)
Parameter pre-processing of values containing pig positional variables ($0, $1 etc) --- Key: PIG-1588 URL: https://issues.apache.org/jira/browse/PIG-1588 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Laukik Chitnis Fix For: 0.7.0 Pig 0.7 requires the positional variables to be escaped by a \\ when passed as part of a parameter value (either through cmd line param or through param_file), which was not the case in Pig 0.6 Assuming that this was not an intended breakage of backward compatibility (could not find it in release notes), this would be a bug. For example, We need to pass INPUT=CountWords(\\$0,\\$1,\\$2) instead of simply INPUT=CountWords($0,$1,$2) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)
[ https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1588. - Resolution: Duplicate This is duplicate of https://issues.apache.org/jira/browse/PIG-1586 and at this point we do not believe that either is a bug in pig. Viraj is verifying that but we think that shell removes the escapes before giving it to Pig Parameter pre-processing of values containing pig positional variables ($0, $1 etc) --- Key: PIG-1588 URL: https://issues.apache.org/jira/browse/PIG-1588 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Laukik Chitnis Fix For: 0.7.0 Pig 0.7 requires the positional variables to be escaped by a \\ when passed as part of a parameter value (either through cmd line param or through param_file), which was not the case in Pig 0.6 Assuming that this was not an intended breakage of backward compatibility (could not find it in release notes), this would be a bug. For example, We need to pass INPUT=CountWords(\\$0,\\$1,\\$2) instead of simply INPUT=CountWords($0,$1,$2) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1537. - Resolution: Fixed Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.8.0 I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); {code} I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach
[ https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-747: --- Fix Version/s: 0.9.0 (was: 0.8.0) Logical to Physical Plan Translation fails when temporary alias are created within foreach -- Key: PIG-747 URL: https://issues.apache.org/jira/browse/PIG-747 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.9.0 Attachments: physicalplan.txt, physicalplanprob.pig, PIG-747-1.patch Consider a the pig script which calculates a new column F inside the foreach as: {code} A = load 'physicalplan.txt' as (col1,col2,col3); B = foreach A { D = col1/col2; E = col3/col2; F = E - (D*D); generate F as newcol; }; dump B; {code} This gives the following error: === Caused by: org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException: ERROR 2015: Invalid physical operators in the physical plan at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246) ... 10 more Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give operator of type org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide multiple outputs. This operator does not support multiple outputs. at org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373) ... 19 more === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1319) New logical optimization rules
[ https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1319: Fix Version/s: 0.9.0 (was: 0.8.0) New logical optimization rules -- Key: PIG-1319 URL: https://issues.apache.org/jira/browse/PIG-1319 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.9.0 In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new logical optimization framework. One design goal for the new logical optimizer is to make it easier to add new logical optimization rules. In this Jira, we keep track of the development of these new logical optimization rules. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Does Pig Re-Use FileInputLoadFuncs Objects?
Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc objects? We suspect state is being retained between different stores, but we don't actually know this. Figured I'd ask to verify the hunch. Our load func for our in-house format works fine with Pig scripts normally... but I have a pig script that looks like this: LOAD thing1 SPLIT thing1 INTO thing2, thing3 STORE thing2 INTO thing2 STORE thing3 INTO thing3 LOAD thing4 SPLIT thing4 INTO thing5, thing6 STORE thing5 INTO thing5 STORE thing6 INTO thing6 And it works via PigStorage, but not via our FileInputLoadFunc. Russ