subject:"\[jira\] Updated\: \(PIG\-1295\) Binary comparator for secondary sort"

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-16 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Resolved (was: Patch Available)
Hadoop Flags: [Reviewed]
Resolution: Fixed

All unit test pass.

test-patch:
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 8 new or
modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning
messages.
[exec]
[exec] -1 javac. The applied patch generated 156 javac compiler
warnings (more than the trunk's current 145 warnings).
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs
warnings.
[exec]
[exec] -1 release audit. The applied patch generated 414 release
audit warnings (more than the trunk's current 410 warnings).

javac warnings is all about deprecate. For release audit warnings, I checked
every new file have license header.

Patch committed. Thanks Gianmarco!

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Attachment: PIG-1295_0.16.patch

Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach
PIG-1295_0.16.patch.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Open (was: Patch Available)

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Patch Available (was: Open)

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Attachment: PIG-1295_0.14.patch

I reviewed and regenerate the patch. Couple of notes:
1. All unit test and end-to-end test pass, hudson warning are addressed
2. See consistent performance improvement (around 20%) in pigmix query L16
(using 10 reducers, on a cluster of 10 nodes)
3. Did some refactory, change some class names and move some code around, move
getRawComparatorClass to Tuple instead of TupleFactory

Gianmarco, can you take a look if my changes are good?

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch,
PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch,
PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Attachment: PIG-1295_0.15.patch

Attach another patch to address Thejas's first point.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Patch Available (was: Open)

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Open (was: Patch Available)

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-06 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.13.patch

Fixed issues for PIG-927 as planned.
Fixed a couple of bugs that came up with testing.
Aligned the code path for the deserialization version.
Ran full test suite, all tests pass.

TODO:
Align code for DefaultTuple (low priority).
Speed tests.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-04 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.12.patch

Ok, first working integration.
Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory.
Created a new class PigSecondaryKeyComparatorNew that should substitute the old
one. This one uses the raw comparators.
Modified JobControlCompiler to use the new comparators.

Moved the null/index semantic outside the raw comparators and inside the
wrappers.

Modified BinSedesTupleComparator to correctly handle sort order. The sort order
is applied to the first call to compare tuples. In case we are doing a
secondary sort, the sort orders are propagated 1 level more (because we have a
nested tuple with the keys, and we need to apply the sort orders to the content
of the outermost tuple).
The code is not the cleanest possible but TestPigTupleRawComparator and
TestSecondarySort pass.

TODO:
Implement the logic for PIG-927.
I plan to create a new interface (TupleRawComparator) and add a method to check
if during the comparison a field of type NULL was encountered. This interface
will be used instead of the simple RawComparator to hold the reference to our
raw comparators.

Write speed test.
Is there something already made that can be used to test the speed improvement?
The inputs for the unit test are of course too small.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-07-31 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.11.patch

Here my first stab at integration.
I split the class in two classes and put them into DefaultTuple and
BinInterSedes.
This asymmetry is not nice but the serialization code is in different places
for different tuples, and I wanted to keep the comparison code as close as
possible to the serialization code.

I modified the TupleFactory interface adding a method to get the tuple
comparator class.
This method is implemented in BinSedesTupleFactory and overridden in
DefaultTupleFactory.

BinSedesTupleFactory delegates it to a package method in BinInterSedes (where
the actual code is).
DefaultTupleFactoru delegates it to DefaultTuple (where the actual code is).

The actual code for both comparators is just a cut/paste of the methods in
PigTupleRawComparatorNew, I just adjusted a bit the entry points.

I left the new comparator and tests untouched for now.

Please let me have any comments you may have.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch,
PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch,
PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-07-17 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.9.patch

Implemented a new compareBinInterSedesTuple() method.
This new method works with data serialized with [PIG-1472] format.
It relies on DataType for data type comparison by translating the data types.
I implemented some logic to read unsinged ints from bytes and shorts because I
am using ByteBuffer that does not implement the DataInput interface. We might
want to change that later.
For now complex data types cause the whole tuple to be deserialized, but I have
put some placeholders for methods to deserialize single complex objects and
continue the normal execution flow.
I plan to fill them in when I move the code inside BinInterSedes so that I can
use the methods to read complex data types (readBag, readMap, etc..). This will
involve some juggling around between DataInputBuffer and ByteBuffer (this is
why we might consider to switch out ByteBuffer) to get the cursors consistent
among them.

I think the next step would be splitting the comparator and putting it into the
serialization class (BinInterSedes). I don't know yet how to modify the class
interface to make the comparator class available outside (but I think this
relates to phase 2).

I see no good place to put the class that implements the comparator for
DefaultTuple, and Daniel said it is a minor case, so should I just throw away
the code for DefaultTuple?

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-07-11 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.8.patch

I added the code for if the user does not use DefaultTuple we fall back to the
default deserialization case. I assume the user defined tuple will have a
different DataType byte from DataType.TUPLE. If this is not the case, I see no
way of discerning DefaultTuple from any other Tuple implementation.
Anyway, I think this issue needs to be properly addressed in the context of
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472].

I added support for BIGCHARARRAY.

UTF-8 decoding is quite convoluted. It is a variable length encoding, so we
cannot avoid using a String. [UTF-8|http://en.wikipedia.org/wiki/UTF-8]

Before tackling the integration with
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472] I need to familiarize
with the code in the patch. I will write a proposal for the integration in the
next days.

I also made some changes to DataByteArray in order to encapsulate the logic for
comparison in a publicly accessible method. This way the raw comparison is
consistent with the behavior of the class, in a way similar to the other cases
where I delegate comparison to the class.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-07-01 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.7.patch

Addressed a small bug and added ASF license to source files

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.6.patch

Ok, if the user does not use DefaultTuple we fall back to the default
deserialization case.

I added handling of nested tuples via recursion and appropriate unit tests.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Patch Available (was: Open)
Fix Version/s: 0.8.0

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-27 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.5.patch

Added support for BYTEARRAY and CHARARRAY data types.
Added fallback behaviour for unknown data types.
Added support for non-raw comparison.

I changed to public the visibility of a the charset field in DataType, because
I need to know which encoding is being used for strings.

I am testing the patch locally before submitting it.

Given that performance tests look good, I think the next step would be to
handle tuples different from DefaultTuple. This requires some changes to the
infrastructure.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-17 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.4.patch

I modified the main in order to include your code snippet. Thanks very much for
the suggestion!

I reduced the number of tuples in order to avoid OutOfMemory exceptions on my
machine (only 1 GiB of RAM). I see the same numbers you reported: 6.5 times
improve in speed only for the compare() method.

I took care of the tabs issue.

I will add support for ByteArrays and CharArrays next, together with fallback
behaviour for complex datatypes.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-12 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.3.patch

Sure,
here it is the revised file.
I left the random testing for unit testing purposes, and I moved the
performance testing to a separate main method.
This makes profiling *much* easier.

FYI, for profiling I am using a mixture of hprof (the java internal profiler)
and jrat (http://jrat.sourceforge.net).

I can prepare a more detailed report of where most of the time is consumed, if
needed.
Roughly, 80% of the test time is spent in the org.apache.pig.data package to
write tuples (DefaultTuple.write() , DataReaderWriter.writeDatum() ,
DataType.findType() are the most expensive methods).
From targeted profiling I have seen that the raw version of the compare method
is around 3x faster than the old one.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-07 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.2.patch

I added some simple performance tests.
The tests generate 1 million tuples modifying a prototypical tuple and compare
them to the prototype.
One test uses the new comparator and the other uses the old one. I generate
exactly the same tuples using a fixed seed. I also check the correctness of the
comparisons using the normal compareTo() method of the tuples.

The logic to generate the tuples is a bit involved: I tried to exercise all the
datatype comparisons in a uniform manner, so I mutate less the first elements
of the tuple, in order to have more probability of getting the comparison
further down the tuple. The probabilities are totally made up and do not make
much sense.

As a first approximation, I see a slight overall speedup in the test.
I will do some profiling to see which margins of improvement we have.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-06 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.1.patch

Here is my frist dratf for the binary comparator, together with some simple
tests.

This comparator should replace PigTupleRawComparator.

I assume that the tuple we are comparing is a NullableTuple that wraps a
DefaultTuple.
There is a comment in the old comparator that says:
{code}
// Users are allowed to implement their own versions of tuples, which means we
have no
// idea what the underlying representation is.
{code}
This will need to be addressed. If I can know that the tuple is not a
DefaultTuple looking at its data type byte, I can switch back to object
deserialization. This means that the user must not use the same DataType.TUPLE
as the default tuple.

The comparator iterates over the serialized tuples and compares them following
the same logic as the old comparator (first check if Null, then their sizes,
then compare field by field, if the fields are of different types compare the
datatypes, else compare the values).
For now I implemented the logic only for simple types, I plan to add bytearrays
and chararrays. I do not plan to add support for complex datatypes. In this
case I will fall back to object deserialization.

The implementation uses a ByteBuffer to easily iterate over the values. I don't
know if it is acceptable or its performance impact but it is very handy. The
alternative is manual bookkeeping of a cursor inside the byte arrays.
To do this I would probably use a map that tells me how many bytes each data
type uses. This map should probably go somewhere else though (DataType?
DataReaderWriter?). I commented out an initial attempt for this at the end of
the class.
An alternative implementation could be to follow the strategy of the others
PigRawComparators, that wrap a comparator from Hadoop. This would require
keeping a number of comparators in memory, though. I commented out this
approach at the beginning of the class.

TODO:
Implement fallback behaviour.
Implement support for bytearrays and chararrays.
Graceful handling of tuples different from DefaultTuple.
Add performance tests.

Any comment is more than welcome.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

21 matches

Site Navigation

Mail list logo

Footer information