[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order
[ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887178#action_12887178 ] Gianmarco De Francisci Morales commented on PIG-1468: - It is quite easy to fix DataType.compare() to keep into account the unsigned logic. But I am starting to feel that all of this is probably not worth the trouble. This would make DataType.compare() for Bytes different from Byte.compareTo(). DataByteArray.compareTo() does not compare in lexicographic order - Key: PIG-1468 URL: https://issues.apache.org/jira/browse/PIG-1468 Project: Pig Issue Type: Bug Reporter: Gianmarco De Francisci Morales Assignee: Gianmarco De Francisci Morales Attachments: PIG-1468.patch The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order. Actually, it takes into account the signum of the bytes that compose the DataByteArray. So, for example, 0xff compares to less than 0x00 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.8.patch I added the code for if the user does not use DefaultTuple we fall back to the default deserialization case. I assume the user defined tuple will have a different DataType byte from DataType.TUPLE. If this is not the case, I see no way of discerning DefaultTuple from any other Tuple implementation. Anyway, I think this issue needs to be properly addressed in the context of [PIG-1472|https://issues.apache.org/jira/browse/PIG-1472]. I added support for BIGCHARARRAY. UTF-8 decoding is quite convoluted. It is a variable length encoding, so we cannot avoid using a String. [UTF-8|http://en.wikipedia.org/wiki/UTF-8] Before tackling the integration with [PIG-1472|https://issues.apache.org/jira/browse/PIG-1472] I need to familiarize with the code in the patch. I will write a proposal for the integration in the next days. I also made some changes to DataByteArray in order to encapsulate the logic for comparison in a publicly accessible method. This way the raw comparison is consistent with the behavior of the class, in a way similar to the other cases where I delegate comparison to the class. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887265#action_12887265 ] Daniel Dai commented on PIG-1472: - Patch looks good. Couple of comments: 1. The following code are never used in BinStorage and InterStorage, should be removed. {code} public static final int RECORD_1 = 0x01; public static final int RECORD_2 = 0x02; public static final int RECORD_3 = 0x03; {code} 2. In BinInterSedes, why do we have type GENERIC_WRITABLECOMPARABLE? When it will be used? 3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate? Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887267#action_12887267 ] Daniel Dai commented on PIG-1472: - Forget 2, GENERIC_WRITABLECOMPARABLE also in DataReaderWriter, we just follow. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887270#action_12887270 ] Daniel Dai commented on PIG-1295: - With the change of PIG-1472, we need to change raw comparator accordingly: 1. Bag comparison should be changed to compare TINYBAG/SMALLBAG/BAG 2. Tuple comparison should be changed to compare TINYTUPLE/SMALLTUPLE/TUPLE 3. Map comparison should be changed to compare TINYMAP/SMALLMAP/MAP 4. Integer comparison should be changed to compare INTEGER_0/INTEGER_1/INTEGER_INBYTE/INTEGER_INSHORT/INTEGER 5. ByteArray comparison should be changed to compare TINYBYTEARRAY/SMALLBYTEARRAY/BYTEARRAY 6. Chararray comparison should be changed to compare SMALLCHARARRAY/CHARARRAY 7. Raw comparator is now depend on the serialization format. Now we have two serialization format, DefaultTuple and BinSedesTuple. It's better to move PigTupleRawComparatorNew inside BinSedesTuple. But in this project, we only focus on BinSedesTuple, which addres most use cases In the integration code, we shall check if TupleFactory is actually BinSedesTupleFactory, if it is, use this raw comparator; otherwise, use the original comparator. I was wrong for the customized tuple. we do not need a fall back scheme for customized tuple. In the serialized format, all Tuples including customized Tuple will be serialized into the same format. Looks like UTF-8 encoding is convoluted, we can leave it for now. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1493) Column Pruner throw exception inconsistent pruning
Column Pruner throw exception inconsistent pruning Key: PIG-1493 URL: https://issues.apache.org/jira/browse/PIG-1493 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0, 0.7.0 The following script fail: {code} a = load '1.txt' as (a0:chararray, a1:chararray, a2); b = foreach a generate CONCAT(a0,a1) as b0, a0, a2; c = foreach b generate a0, a2; dump c; {code} Error message: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias c at org.apache.pig.PigServer.openIterator(PigServer.java:698) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:451) at org.apache.pig.Main.main(Main.java:103) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias c at org.apache.pig.PigServer.storeEx(PigServer.java:804) at org.apache.pig.PigServer.store(PigServer.java:760) at org.apache.pig.PigServer.openIterator(PigServer.java:680) ... 7 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: Unable to prune plan at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240) at org.apache.pig.PigServer.compileLp(PigServer.java:1180) at org.apache.pig.PigServer.storeEx(PigServer.java:799) ... 9 more Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot prune columns for (Name: b: ForEach 1-50 Operator Key: 1-50) at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177) at org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821) ... 12 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning
[ https://issues.apache.org/jira/browse/PIG-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1493: Attachment: PIG-1493-1.patch Column Pruner throw exception inconsistent pruning Key: PIG-1493 URL: https://issues.apache.org/jira/browse/PIG-1493 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1493-1.patch The following script fail: {code} a = load '1.txt' as (a0:chararray, a1:chararray, a2); b = foreach a generate CONCAT(a0,a1) as b0, a0, a2; c = foreach b generate a0, a2; dump c; {code} Error message: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias c at org.apache.pig.PigServer.openIterator(PigServer.java:698) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:451) at org.apache.pig.Main.main(Main.java:103) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias c at org.apache.pig.PigServer.storeEx(PigServer.java:804) at org.apache.pig.PigServer.store(PigServer.java:760) at org.apache.pig.PigServer.openIterator(PigServer.java:680) ... 7 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: Unable to prune plan at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240) at org.apache.pig.PigServer.compileLp(PigServer.java:1180) at org.apache.pig.PigServer.storeEx(PigServer.java:799) ... 9 more Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot prune columns for (Name: b: ForEach 1-50 Operator Key: 1-50) at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177) at org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821) ... 12 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning
[ https://issues.apache.org/jira/browse/PIG-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1493: Status: Patch Available (was: Open) Column Pruner throw exception inconsistent pruning Key: PIG-1493 URL: https://issues.apache.org/jira/browse/PIG-1493 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0, 0.7.0 Attachments: PIG-1493-1.patch The following script fail: {code} a = load '1.txt' as (a0:chararray, a1:chararray, a2); b = foreach a generate CONCAT(a0,a1) as b0, a0, a2; c = foreach b generate a0, a2; dump c; {code} Error message: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias c at org.apache.pig.PigServer.openIterator(PigServer.java:698) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:451) at org.apache.pig.Main.main(Main.java:103) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias c at org.apache.pig.PigServer.storeEx(PigServer.java:804) at org.apache.pig.PigServer.store(PigServer.java:760) at org.apache.pig.PigServer.openIterator(PigServer.java:680) ... 7 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: Unable to prune plan at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240) at org.apache.pig.PigServer.compileLp(PigServer.java:1180) at org.apache.pig.PigServer.storeEx(PigServer.java:799) ... 9 more Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot prune columns for (Name: b: ForEach 1-50 Operator Key: 1-50) at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177) at org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821) ... 12 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning at org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.