[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed All unit test pass. test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 156 javac compiler warnings (more than the trunk's current 145 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 414 release audit warnings (more than the trunk's current 410 warnings). javac warnings is all about deprecate. For release audit warnings, I checked every new file have license header. Patch committed. Thanks Gianmarco! Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.16.patch Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach PIG-1295_0.16.patch. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Open (was: Patch Available) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.14.patch I reviewed and regenerate the patch. Couple of notes: 1. All unit test and end-to-end test pass, hudson warning are addressed 2. See consistent performance improvement (around 20%) in pigmix query L16 (using 10 reducers, on a cluster of 10 nodes) 3. Did some refactory, change some class names and move some code around, move getRawComparatorClass to Tuple instead of TupleFactory Gianmarco, can you take a look if my changes are good? Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.15.patch Attach another patch to address Thejas's first point. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Open (was: Patch Available) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.13.patch Fixed issues for PIG-927 as planned. Fixed a couple of bugs that came up with testing. Aligned the code path for the deserialization version. Ran full test suite, all tests pass. TODO: Align code for DefaultTuple (low priority). Speed tests. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.12.patch Ok, first working integration. Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory. Created a new class PigSecondaryKeyComparatorNew that should substitute the old one. This one uses the raw comparators. Modified JobControlCompiler to use the new comparators. Moved the null/index semantic outside the raw comparators and inside the wrappers. Modified BinSedesTupleComparator to correctly handle sort order. The sort order is applied to the first call to compare tuples. In case we are doing a secondary sort, the sort orders are propagated 1 level more (because we have a nested tuple with the keys, and we need to apply the sort orders to the content of the outermost tuple). The code is not the cleanest possible but TestPigTupleRawComparator and TestSecondarySort pass. TODO: Implement the logic for PIG-927. I plan to create a new interface (TupleRawComparator) and add a method to check if during the comparison a field of type NULL was encountered. This interface will be used instead of the simple RawComparator to hold the reference to our raw comparators. Write speed test. Is there something already made that can be used to test the speed improvement? The inputs for the unit test are of course too small. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.11.patch Here my first stab at integration. I split the class in two classes and put them into DefaultTuple and BinInterSedes. This asymmetry is not nice but the serialization code is in different places for different tuples, and I wanted to keep the comparison code as close as possible to the serialization code. I modified the TupleFactory interface adding a method to get the tuple comparator class. This method is implemented in BinSedesTupleFactory and overridden in DefaultTupleFactory. BinSedesTupleFactory delegates it to a package method in BinInterSedes (where the actual code is). DefaultTupleFactoru delegates it to DefaultTuple (where the actual code is). The actual code for both comparators is just a cut/paste of the methods in PigTupleRawComparatorNew, I just adjusted a bit the entry points. I left the new comparator and tests untouched for now. Please let me have any comments you may have. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.9.patch Implemented a new compareBinInterSedesTuple() method. This new method works with data serialized with [PIG-1472] format. It relies on DataType for data type comparison by translating the data types. I implemented some logic to read unsinged ints from bytes and shorts because I am using ByteBuffer that does not implement the DataInput interface. We might want to change that later. For now complex data types cause the whole tuple to be deserialized, but I have put some placeholders for methods to deserialize single complex objects and continue the normal execution flow. I plan to fill them in when I move the code inside BinInterSedes so that I can use the methods to read complex data types (readBag, readMap, etc..). This will involve some juggling around between DataInputBuffer and ByteBuffer (this is why we might consider to switch out ByteBuffer) to get the cursors consistent among them. I think the next step would be splitting the comparator and putting it into the serialization class (BinInterSedes). I don't know yet how to modify the class interface to make the comparator class available outside (but I think this relates to phase 2). I see no good place to put the class that implements the comparator for DefaultTuple, and Daniel said it is a minor case, so should I just throw away the code for DefaultTuple? Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.8.patch I added the code for if the user does not use DefaultTuple we fall back to the default deserialization case. I assume the user defined tuple will have a different DataType byte from DataType.TUPLE. If this is not the case, I see no way of discerning DefaultTuple from any other Tuple implementation. Anyway, I think this issue needs to be properly addressed in the context of [PIG-1472|https://issues.apache.org/jira/browse/PIG-1472]. I added support for BIGCHARARRAY. UTF-8 decoding is quite convoluted. It is a variable length encoding, so we cannot avoid using a String. [UTF-8|http://en.wikipedia.org/wiki/UTF-8] Before tackling the integration with [PIG-1472|https://issues.apache.org/jira/browse/PIG-1472] I need to familiarize with the code in the patch. I will write a proposal for the integration in the next days. I also made some changes to DataByteArray in order to encapsulate the logic for comparison in a publicly accessible method. This way the raw comparison is consistent with the behavior of the class, in a way similar to the other cases where I delegate comparison to the class. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.7.patch Addressed a small bug and added ASF license to source files Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.6.patch Ok, if the user does not use DefaultTuple we fall back to the default deserialization case. I added handling of nested tuples via recursion and appropriate unit tests. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Fix Version/s: 0.8.0 Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.5.patch Added support for BYTEARRAY and CHARARRAY data types. Added fallback behaviour for unknown data types. Added support for non-raw comparison. I changed to public the visibility of a the charset field in DataType, because I need to know which encoding is being used for strings. I am testing the patch locally before submitting it. Given that performance tests look good, I think the next step would be to handle tuples different from DefaultTuple. This requires some changes to the infrastructure. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.4.patch I modified the main in order to include your code snippet. Thanks very much for the suggestion! I reduced the number of tuples in order to avoid OutOfMemory exceptions on my machine (only 1 GiB of RAM). I see the same numbers you reported: 6.5 times improve in speed only for the compare() method. I took care of the tabs issue. I will add support for ByteArrays and CharArrays next, together with fallback behaviour for complex datatypes. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.3.patch Sure, here it is the revised file. I left the random testing for unit testing purposes, and I moved the performance testing to a separate main method. This makes profiling *much* easier. FYI, for profiling I am using a mixture of hprof (the java internal profiler) and jrat (http://jrat.sourceforge.net). I can prepare a more detailed report of where most of the time is consumed, if needed. Roughly, 80% of the test time is spent in the org.apache.pig.data package to write tuples (DefaultTuple.write() , DataReaderWriter.writeDatum() , DataType.findType() are the most expensive methods). From targeted profiling I have seen that the raw version of the compare method is around 3x faster than the old one. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.2.patch I added some simple performance tests. The tests generate 1 million tuples modifying a prototypical tuple and compare them to the prototype. One test uses the new comparator and the other uses the old one. I generate exactly the same tuples using a fixed seed. I also check the correctness of the comparisons using the normal compareTo() method of the tuples. The logic to generate the tuples is a bit involved: I tried to exercise all the datatype comparisons in a uniform manner, so I mutate less the first elements of the tuple, in order to have more probability of getting the comparison further down the tuple. The probabilities are totally made up and do not make much sense. As a first approximation, I see a slight overall speedup in the test. I will do some profiling to see which margins of improvement we have. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.1.patch Here is my frist dratf for the binary comparator, together with some simple tests. This comparator should replace PigTupleRawComparator. I assume that the tuple we are comparing is a NullableTuple that wraps a DefaultTuple. There is a comment in the old comparator that says: {code} // Users are allowed to implement their own versions of tuples, which means we have no // idea what the underlying representation is. {code} This will need to be addressed. If I can know that the tuple is not a DefaultTuple looking at its data type byte, I can switch back to object deserialization. This means that the user must not use the same DataType.TUPLE as the default tuple. The comparator iterates over the serialized tuples and compares them following the same logic as the old comparator (first check if Null, then their sizes, then compare field by field, if the fields are of different types compare the datatypes, else compare the values). For now I implemented the logic only for simple types, I plan to add bytearrays and chararrays. I do not plan to add support for complex datatypes. In this case I will fall back to object deserialization. The implementation uses a ByteBuffer to easily iterate over the values. I don't know if it is acceptable or its performance impact but it is very handy. The alternative is manual bookkeeping of a cursor inside the byte arrays. To do this I would probably use a map that tells me how many bytes each data type uses. This map should probably go somewhere else though (DataType? DataReaderWriter?). I commented out an initial attempt for this at the end of the class. An alternative implementation could be to follow the strategy of the others PigRawComparators, that wrap a comparator from Hadoop. This would require keeping a number of comparators in memory, though. I commented out this approach at the beginning of the class. TODO: Implement fallback behaviour. Implement support for bytearrays and chararrays. Graceful handling of tuples different from DefaultTuple. Add performance tests. Any comment is more than welcome. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to