[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.14.patch I reviewed and regenerate the patch. Couple of notes: 1. All unit test and end-to-end test pass, hudson warning are addressed 2. See consistent performance improvement (around 20%) in pigmix query L16 (using 10 reducers, on a cluster of 10 nodes) 3. Did some refactory, change some class names and move some code around, move getRawComparatorClass to Tuple instead of TupleFactory Gianmarco, can you take a look if my changes are good? Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
[ https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair resolved PIG-1442. Resolution: Duplicate java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766) --- Key: PIG-1442 URL: https://issues.apache.org/jira/browse/PIG-1442 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev (18/may) Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0 Reporter: Dirk Schmid Assignee: Thejas M Nair Fix For: 0.8.0 As mentioned by Ashutosh this is a reopen of https://issues.apache.org/jira/browse/PIG-766 because there is still a problem which causes that PIG scales only by memory. For convenience here comes the last entry of the PIG-766-Jira-Ticket: {quote}1. Are you getting the exact same stack trace as mentioned in the jira?{quote} Yes the same and some similar traces: {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263) at org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at
[jira] Commented: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897866#action_12897866 ] Richard Ding commented on PIG-1541: --- Results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to i [exec] nclude 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path
[ https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1486: --- Attachment: PIG-1486.1.patch Updated patch which includes guava and jython. It might need more changes after PIG-1452 is committed. update ant eclipse-files target to include new jar and remove contrib dirs from build path -- Key: PIG-1486 URL: https://issues.apache.org/jira/browse/PIG-1486 Project: Pig Issue Type: Bug Components: tools Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 Attachments: PIG-1486.1.patch, PIG-1486.patch .eclipse.templates/.classpath needs to be updated to address following - 1. There is a new jar that is used by the code - guava-r03.jar 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse. 3. Removing the contrib projects from class path as discussed in PIG-1390, until all libs necessary for the contribs are included in classpath. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897887#action_12897887 ] Yan Zhou commented on PIG-1518: --- During the merge process, any empty splits will be skipped. Currently empty splits will be generated on empty files, which is not necessary at the first place. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897922#action_12897922 ] Thejas M Nair commented on PIG-1295: Comments about PIG-1295_0.14.patch - - The comparison logic for BinInterSedes relies on the serliazation format, so it think its better to have it closer to where the serialization format is implemented. Ie add a function to InterSedes interface (getComparator() ?) , and move the implementation logic to BinInterSedes class. - I think TupleFactory is a better place for getRawComparatorClass() for the following reasons- -- TupleFactory is a singleton class, Tuple is not. Having it in Tuple implies that you can have different values returned by different instances. -- Adding it to Tuple interface breaks backward compatibility, all Tuple implementations will need to add this function. Also, does not make sense for load functions that return a custom tuple to implement this method, because it is not related to that tuple implementation. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1544) proactive-spill bags should share the memory alloted for it
proactive-spill bags should share the memory alloted for it --- Key: PIG-1544 URL: https://issues.apache.org/jira/browse/PIG-1544 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage . But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. This needs to be fixed and all proactive-spill bags should share the memory-limit . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897958#action_12897958 ] Thejas M Nair commented on PIG-1295: bq. Conceptually comparator is in the logic of Tuple. This comparator is part of only the *default* tuple implementation used internally within pig. So the class that is the source of truth for the default internal tuple implementation seems a good place to have this function. A tuple returned by a loadfunction has nothing to do with the comparator logic. bq. Ideally it should be a static method of Tuple, however Tuple interface do not allow me do that. Yes, a static method can't be overridden. Since this is supposed to return only one value per pig query, the singleton TupleFactory is a better place. bq. For backward compatibility, first, we will break either Tuple or TupleFactory, the impact is equivalent; No. TupleFactory is an abstract class, while Tuple is an interface. Users will not be forced to change their implementation if we add a function to TupleFactory. Also, users are more likely to have custom Tuple than custom TupleFactory - because they might implement different tuples as part of their load function implementation, and are unlikely to change the default Tuple implementation used in internally in pig. bq. second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if Tuple does not implement the new method, we fall back to the default serialize version. If Tuple interface is going to have this function, i think we should add in the javadoc that it makes sense to implement the function only if it is going to be used as the default internal tuple implementation. And that null value can be returned if user chooses to not implement it. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.15.patch Attach another patch to address Thejas's first point. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Open (was: Patch Available) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1466) Improve log messages for memory usage
[ https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898020#action_12898020 ] Thejas M Nair commented on PIG-1466: I am planning to change as following - 1. Change the string low memory handler called to cleaner notified . 2. Print additional log message stating total spill object memory freed and total number of objects freed. This will be printed the first time a candidate to be freed is found, and then every 10 times the GC is called from the code. (GC is called after a threshold of memory suitable to be freed has been collected.) Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.