Re: load files
Thanks, Jeff. In pig, the file name look like this: part-m-x(for map result) or part-r-x(for reduce result), which are different from the hadoop style (part-x). So, can we control the name of each generated file? How? Thanks, -Gang - 原始邮件 发件人: Jeff Zhang zjf...@gmail.com 收件人: pig-dev@hadoop.apache.org 发送日期: 2010/6/27 (周日) 9:22:30 下午 主 题: Re: load files Hi Gang, The path specified in load can be both file or directory, besides you can also leverage hadoop's globstatus. The path specified in store is a directory. On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi all, when we specify the path of input to a load operator, is it a file or a directory? Similarly, when we use store-load to connect two MR operators, is the path specified in the store and load a directory? Thanks, -Gang -- Best Regards Jeff Zhang
[jira] Created: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
map/red jobs fail using G1 GC (Couldn't find heap) -- Key: PIG-1470 URL: https://issues.apache.org/jira/browse/PIG-1470 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Hadoop: 0.20.1 Reporter: Randy Prager Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value /property {noformat} Here is the hadoop map/red configuration that succeeds {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops/value /property {noformat} Here is the exception from the pig script. {noformat} Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set up the load function. at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[,]' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) ... 5 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) ... 6 more Caused by: java.lang.RuntimeException: Couldn't find heap at org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95) at org.apache.pig.data.BagFactory.init(BagFactory.java:106) at org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71) at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) at org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79) ... 11 more {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Attachment: PIG-1389.patch sync with the latest trunk. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1467) order by fail when set fs.file.impl.disable.cache to true
[ https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883230#action_12883230 ] Richard Ding commented on PIG-1467: --- +1 order by fail when set fs.file.impl.disable.cache to true --- Key: PIG-1467 URL: https://issues.apache.org/jira/browse/PIG-1467 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1467-1.patch, PIG-1467-2.patch Order by fail with the message: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:551) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) This happens with the following hadoop settings: fs.file.impl.disable.cache=true fs.hdfs.impl.disable.cache=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
[ https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883302#action_12883302 ] Randy Prager commented on PIG-1470: --- thanks. we started testing w/ G1 GC on our hadoop cluster to avoid (which it seems to do) the exceptions {noformat} java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) {noformat} which occur randomly on 6u18,6u20 and the default GC. We are going to try some other Java version + GC combinations ... do you have any insight into a stable mix of Java versions and GC? map/red jobs fail using G1 GC (Couldn't find heap) -- Key: PIG-1470 URL: https://issues.apache.org/jira/browse/PIG-1470 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Hadoop: 0.20.1 Reporter: Randy Prager Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value /property {noformat} Here is the hadoop map/red configuration that succeeds {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops/value /property {noformat} Here is the exception from the pig script. {noformat} Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set up the load function. at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[,]' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) ... 5 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) ... 6 more Caused by: java.lang.RuntimeException: Couldn't find heap at org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95) at org.apache.pig.data.BagFactory.init(BagFactory.java:106) at org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71) at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) at org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79) ... 11 more {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1471) inline UDFs in scripting languages
inline UDFs in scripting languages -- Key: PIG-1471 URL: https://issues.apache.org/jira/browse/PIG-1471 Project: Pig Issue Type: New Feature Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.8.0 It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. It should be possible to write these scripts inline as part of pig scripts. This feature is an extension of https://issues.apache.org/jira/browse/PIG-928 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1471) inline UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883327#action_12883327 ] Aniket Mokashi commented on PIG-1471: - The proposed syntax is {code} define hellopig using org.apache.pig.scripting.jython.JythonScriptEngine as '@outputSchema(x:{t:(word:chararray)})\ndef helloworld():\n\treturn ('Hello, World')'; {code} inline UDFs in scripting languages -- Key: PIG-1471 URL: https://issues.apache.org/jira/browse/PIG-1471 Project: Pig Issue Type: New Feature Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.8.0 It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. It should be possible to write these scripts inline as part of pig scripts. This feature is an extension of https://issues.apache.org/jira/browse/PIG-928 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.6.patch Ok, if the user does not use DefaultTuple we fall back to the default deserialization case. I added handling of nested tuples via recursion and appropriate unit tests. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883348#action_12883348 ] Yan Zhou commented on PIG-1399: --- Other expression optimizations include: 3. Optimization of erasure of logical implicated expression in AND Example: B = filter A by (a0 5 and a0 7); = B = filter A by a0 7; 4. Optimization of erasure of logical implicated expression in OR Example: B = filter A by ((a0 5) or (a0 6 and a1 15); = B = filter C by a0 5; A comprehensive example of 2, 3 and 4 optimizations is: B = filter A by NOT((a0 1 and a0 0) or (a1 3 and a0 5)); = B = filter A by a0 = 1; Logical Optimizer: Expression optimizor rule Key: PIG-1399 URL: https://issues.apache.org/jira/browse/PIG-1399 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Yan Zhou We can optimize expression in several ways: 1. Constant pre-calculation Example: B = filter A by a0 5+7; = B = filter A by a0 12; 2. Boolean expression optimization Example: B = filter A by not (not(a05) or a10); = B = filter A by a05 and a=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883354#action_12883354 ] Richard Ding commented on PIG-1389: --- It seems there is no good solution for Merge Join and Merge Cogroup in this case. So I'm going to treat them the same way as Replicated Join and not add counters for all side files. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation - Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair reassigned PIG-1473: -- Assignee: Thejas M Nair Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation - Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883361#action_12883361 ] Daniel Dai commented on PIG-1295: - Thanks, is the patch ready for review? Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Avoiding serialization/de-serialization in pig
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple
Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple Key: PIG-1474 URL: https://issues.apache.org/jira/browse/PIG-1474 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes . The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce . To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property pig.data.tuple.factory.name . This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Fix Version/s: 0.8.0 Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Attachment: PIG-1389_1.patch Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883367#action_12883367 ] Gianmarco De Francisci Morales commented on PIG-1295: - I think it is Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Status: Open (was: Patch Available) Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Status: Patch Available (was: Open) Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Avoiding serialization/de-serialization in pig
For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? -D On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote: I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
Re: Avoiding serialization/de-serialization in pig
I don't fully understand the repercussions of this, but I like it. We're moving from our VoldemortStorage stuff to Avro and it would be great to pipe Avro all the way through. Russ On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? -D On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote: I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading _
[ https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1350: Fix Version/s: (was: 0.8.0) [Zebra] Zebra column names cannot have leading _ -- Key: PIG-1350 URL: https://issues.apache.org/jira/browse/PIG-1350 Project: Pig Issue Type: Improvement Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: pig-1350.patch, pig-1350.patch Disallowing '_' as leading character in column names in Zebra schema is too restrictive, which should be lifted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint
[ https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1120: Fix Version/s: (was: 0.8.0) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint - Key: PIG-1120 URL: https://issues.apache.org/jira/browse/PIG-1120 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang If user doesn't want to specify storage hint, current zebra implementation only support using org.apache.hadoop.zebra.pig.TableStorer('') Note: empty string in TableStorer(' '). We should support the format of using org.apache.hadoop.zebra.pig.TableStorer() as we do on using org.apache.hadoop.zebra.pig.TableLoader() sample pig script: register /grid/0/dev/hadoopqa/jars/zebra.jar; a = load '1.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); b = load '2.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); c = join a by a, b by a; d = foreach c generate a::a, a::b, b::c; describe d; dump d; store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer(''); --this will fail --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( ); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements
[ https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1137: Fix Version/s: (was: 0.8.0) [zebra] get* methods of Zebra Map/Reduce APIs need improvements --- Key: PIG-1137 URL: https://issues.apache.org/jira/browse/PIG-1137 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Currently the set* methods takes external Zebra objects, namely objects of ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. Correspondingly, the get* methods should return such objects instead of String or Zebra internal objects like Schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records
[ https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1355: Fix Version/s: (was: 0.8.0) Description: Applications may not always want to write a record to a table. Zebra should allow application to do the same. Zebra Mutipile Outputs interface allow users to stream data to different tables by inspecting the data Tuple. https://issues.apache.org/jira/browse/PIG- So, If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that record and thus will not write to any table However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) will write every record to a table was: Applications may not always want to write a record to a table. Zebra should allow application to do the same. Zebra Mutipile Outputs interface allow users to stream data to different tables by inspecting the data Tuple. https://issues.apache.org/jira/browse/PIG- So, If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that record and thus will not write to any table However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) will write every record to a table [Zebra] Zebra Multiple Outputs should enable application to skip records - Key: PIG-1355 URL: https://issues.apache.org/jira/browse/PIG-1355 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Gaurav Jain Assignee: Gaurav Jain Priority: Minor Applications may not always want to write a record to a table. Zebra should allow application to do the same. Zebra Mutipile Outputs interface allow users to stream data to different tables by inspecting the data Tuple. https://issues.apache.org/jira/browse/PIG- So, If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that record and thus will not write to any table However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) will write every record to a table -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode
[ https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1411: Fix Version/s: (was: 0.8.0) Description: Due to column group structure, Zebra can create extra files for namenode to remember. That means namenode taking more memory for Zebra related files. The goal is to reduce the no of files/blocks The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive reduces the block and file count by copying data from small files ( 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks and files. was: Due to column group structure, Zebra can create extra files for namenode to remember. That means namenode taking more memory for Zebra related files. The goal is to reduce the no of files/blocks The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive reduces the block and file count by copying data from small files ( 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks and files. [Zebra] Can Zebra use HAR to reduce file/block count for namenode - Key: PIG-1411 URL: https://issues.apache.org/jira/browse/PIG-1411 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Gaurav Jain Assignee: Gaurav Jain Priority: Minor Due to column group structure, Zebra can create extra files for namenode to remember. That means namenode taking more memory for Zebra related files. The goal is to reduce the no of files/blocks The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive reduces the block and file count by copying data from small files ( 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks and files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
[ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1337: Fix Version/s: (was: 0.8.0) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc -- Key: PIG-1337 URL: https://issues.apache.org/jira/browse/PIG-1337 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Chao Wang The Zebra storage layer needs to use distributed cache to reduce name node load during job runs. To to this, Zebra needs to set up distributed cache related configuration information in TableLoader (which extends Pig's LoadFunc) . It is doing this within getSchema(conf). The problem is that the conf object here is not the one that is being serialized to map/reduce backend. As such, the distributed cache is not set up properly. To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use to set up distributed cache information in a conf object, and this conf object is the one used by map/reduce backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883382#action_12883382 ] Jeff Zhang commented on PIG-1473: - This sounds like the lazy deserialization in Hive, Great ! Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation - Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.