subject:"\[jira\] Commented\: $PIG\-1295$ Binary comparator for secondary sort"

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-16 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898845#action_12898845
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

Thanks for your support Daniel.
It was a great experience.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
Fix For: 0.8.0

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Thejas M Nair (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897922#action_12897922
]

Thejas M Nair commented on PIG-1295:

Comments about PIG-1295_0.14.patch -
- The comparison logic for BinInterSedes relies on the serliazation format, so
it think its better to have it closer to where the serialization format is
implemented. Ie add a function to InterSedes interface (getComparator() ?) ,
and move the implementation logic to BinInterSedes class.
- I think TupleFactory is a better place for getRawComparatorClass() for the
following reasons-
-- TupleFactory is a singleton class, Tuple is not. Having it in Tuple implies
that you can have different values returned by different instances.
-- Adding it to Tuple interface breaks backward compatibility, all Tuple
implementations will need to add this function. Also, does not make sense for
load functions that return a custom tuple to implement this method, because it
is not related to that tuple implementation.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch,
PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch,
PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Thejas M Nair (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897958#action_12897958
 ] 

Thejas M Nair commented on PIG-1295:


bq. Conceptually comparator is in the logic of Tuple. 
This comparator is part of only the *default* tuple implementation used 
internally within pig. So the class that is the source of truth for the default 
internal tuple implementation seems a good place to have this function. A tuple 
returned by a loadfunction has nothing to do with the comparator logic. 

bq. Ideally it should be a static method of Tuple, however Tuple interface do 
not allow me do that.
Yes, a static method can't be overridden. Since this is supposed to return only 
one value per pig query, the singleton TupleFactory is a better place.

bq. For backward compatibility, first, we will break either Tuple or 
TupleFactory, the impact is equivalent;
No. TupleFactory is an abstract class, while Tuple is an interface. Users will 
not be forced to change their implementation if we add a function to 
TupleFactory. Also, users are more likely to have custom Tuple than custom 
TupleFactory - because they might implement different tuples as part of their 
load function implementation, and are unlikely to change the default Tuple 
implementation used in internally in pig.

bq. second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we 
will check if Tuple does not implement the new method, we fall back to the 
default serialize version. 
If Tuple interface is going to have this function, i think we should add in the 
javadoc that it makes sense to implement the function only if it is going to be 
used as the default internal tuple implementation. And that
null value can be returned if user chooses to not implement it.



 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, 
 PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, 
 PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-08 Thread Gianmarco De Francisci Morales (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896339#action_12896339
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

I performed a some speed tests using PigMix2.
I used query L16 and generated 2 datasets: one with 1M rows (1.6GiB) and the 
other with 10M rows (16GiB).
I took the times end to end using the time utility.
I do not have a cluster so I ran pig locally.
Here the results.

Trunk   1M  0m53.469s
Patched 1M  0m39.076s
Trunk   10M 9m49.507s
Patched 10M 8m0.048s

We have a 20~30% improvement end-to-end for this query.
I think this is consistent with the expectations.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, 
 PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, 
 PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-08 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896390#action_12896390
]

Daniel Dai commented on PIG-1295:
-

That is terrific! I will review your patch shortly.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-03 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894841#action_12894841
]

Daniel Dai commented on PIG-1295:
-

Hi, Gianmarco,
When you do an explain, you will see Secondary sort: true if the MR plan will
use secondary sort. I am not yet sure why your script not using secondary sort,
I will check it tomorrow. However, the following script will use secondary sort:

{code}
A = LOAD '1.txt' AS (a0, a1, a2);
B = group A by $0 parallel 2;
C = foreach B { D = limit A 10; E = order D by $1; generate group, E;};
explain C;
{code}

When secondary sort is used, Pig will use PigSecondaryKeyComparator, which is
not actually raw. We need to replace it with your raw comparator.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.11.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch,
PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch,
PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-03 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894961#action_12894961
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

Thanks for the suggestion Daniel.

My idea to integrate the RawTupleComparator is to modify
PigSecondaryKeyComparator to delegate the comparison to it. I cannot mimic the
current behavior of making 2 calls (one for the main and one for the secondary
key) because I do not know the boundaries between the keys. So I need to make a
single call to compare the compound key. There are some issues though.

1) There are some inconsistencies between the behavior of PigTupleRawComparator
(the original one) and PigSecondaryKeyComparator.
Specifically, when tuples are null the first one returns 0 while the second one
compares the indexes.
Furthermore, indexes are compared also when one of the fields in the tuples is
null, in order not to join them (if I understood correctly PIG-927). This is
done in PigSecondaryKeyComparator but not in PigTupleRawComparator.
Is it designed to be like this or is it a bug?
I suppose the behaviors of the two comparators should be more or less the same.

2) In PigSecondaryKeyComparator the key is assumed to be a 2-field tuple where
the 0th field is the main key and the 1st field is the secondary key.
We can directly feed the binary representation of this tuple inside our new raw
comparator, but we need to consolidate the sort orders. Right now there are 2
different and independent sort orders serialized in the jobConf (pig.sortOrder
and pig.secondarySortOrder). In the simple case, when sort orders for all the
columns are specified, we can just concatenate them together (sort of). There
are some problems when we have WholeTuple sort orders as they might differ.
I would like to keep all of this out of the tuple comparator and define some
clean interface to pass the sort orders. One problem I see is that I probably
need the tuple sizes (recursively) to do this, and this is not known at
configuration time. I also need to fix this in the current comparator, in order
to take into account the recursion inside nested tuples.

3) Should I keep all the mIndex/mNull handling outside the RawTupleComparator
and write wrappers that deal with them?
That is, should the RawTupleComparator know how to deal with a NullableTuple or
should it just know its kind of Tuple (BinInterSedes or Default)

Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-03 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895093#action_12895093
]

Daniel Dai commented on PIG-1295:
-

Hi, Gianmarco,
1. Notice currently PigTupleRawComparator is only used in order-by job,
PigSecondaryKeyComparator only used by non-order-by job, there is a semantic
difference in null handling. In order-by, different null key are treated equal;
In group-by, however, null from different relation cannot merge (this is to
conform with SQL standard). In your case, RawTupleComparator can be used in
both order-by tuple case and non-order-by with secondary key case. So there
will be semantic gap between this two. We need to deal with it separately. For
simplicity, we can focus on secondary key case first (which means, different
null keys from different relations are not equal)

2. We need to deal with sort order. When we use PigTupleRawComparator for
secondary sort, it's quite clear the structure will be (main_key,
secondary_key, value). I think we can limit PigTupleRawComparator specific to
this structure currently (don't think about order-by tuple case), so you can
pass the sort order and deal with it in PigTupleRawComparator

3. Yes, I think it might be better to keep mIndex/mNull handling outside
RawTupleComparator

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-01 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894382#action_12894382
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I have some problems understanding the SecondaryKeyOptimizer.

For example, given this query:

{code}

A = LOAD 'input1' AS (a0,a1,a2);
B = LOAD 'input2' AS (b0,b1,b2);
C = cogroup A by (a0,a1), B by (b0,b1) parallel 2;
D = foreach C { E = limit A 10; F = E.a2; G = DISTINCT F; generate group,
COUNT(G);};

{code}

The key type is correctly recognized to be a tuple, the
so.getNumMRUseSecondaryKey() is 1, but when I get to JobControlCompiler
mro.getUseSecondaryKey() is false.

Then, when it chooses the comparator in selectComparator(), hasOrderBy , which
is (mro.isGlobalSort() || mro.isLimitAfterSort() || mro.usingTypedComparator())
, is false.

So I get into the second switch statement and I get these comparators

case DataType.TUPLE:
job.setSortComparatorClass(PigTupleWritableComparator.class);

job.setGroupingComparatorClass(PigGroupingTupleWritableComparator.class);

that, to me, look like they are already comparing tuples in raw format (they
use WritableComparator.compareBytes).

Is this because the query is one in which order semantics do not matter, so it
is already optimized?
Should it change if I add an ORDER BY somewhere before the LIMIT?
Is this the relevant case we want to optimize?

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-26 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892417#action_12892417
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

Thanks, here my observations:

1. If you look at the old comparator, it uses DataType.compare(). In
DataType:454-458 if the two data types are not equal, the value returned is the
difference between the datatypes. I retained the same behavior in the patch.
2. I think we already do that. There is an additional guard in the for loop,
that goes on only if rc == 0 on line 452
{code}
for (int i = 0; i tsz1 rc == 0; i++) {
{code}
3. Yes, I somehow changed the number of tuples without noticing. I got back to
10e3 and I see a 10x improvement in time now
4. Sure!

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch,
PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch,
PIG-1295_0.8.patch, PIG-1295_0.9.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-20 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890384#action_12890384
]

Daniel Dai commented on PIG-1295:
-

Patch looks pretty good. Thanks Gianmarco! Couple of comments:
1.
PigTupleRawComparatorNew:324,332,343,357,367,377,387,399,416,474,483,501,512,etc,
if GeneralizedDataType is not equal, we should throw exception to contain the
error
2. PigTupleRawComparatorNew:455-464, if the comparison of two items is not
equal, we shall return the result without comparing additional items, that's
how we get performance gain
3. I am unable to run TestPigTupleRawComparator.main due to OOM, what is the
speed up after the change?
4. PigTupleRawComparatorNew:132, we shall move the logic of choosing the right
comparator to Pig code, and move comparator into BinSedesTuple and
DefaultTuple. This is part of integration work and let's mark it as the first
thing for phase 2.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-16 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889231#action_12889231
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

In DataType the type bytes are sorted in such a way that the comparison between
different data types yields a standard order. This is achieved by carefully
assigning the byte values to the types.
In BinInterSedes this does not happen. So, to reproduce the same order, I need
to sort the bytes somehow. The easiest way is to reassign the values in a way
that is coherent with DataType. The hard way would be to implement a comparison
method with all the possible combinations taken into account, but this is crazy
to maintain.
I have also the same problem for costants: because for INTEGER_0/1 and
BOOLEAN_TRUE/FALSE there is no value to read, and the two data type bytes are
different, with the current design I need to ensure that BOOLEAN_TRUE
BOOLEAN_FALSE and INTEGER_1 INTEGER_0.
Furthermore, It would be good to sort the byte types so that INTEGER
INTEGER_INSHORT INTEGER_INBYTE etc...

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-16 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889236#action_12889236
]

Daniel Dai commented on PIG-1295:
-

Hi, Gianmarco,
I think it's better not to assume INTEGER INTEGER_INSHORT. What type tells
you is how to read the next data correctly. So if you see a INTEGER_INSHORT,
read INTEGER_INSHORT into an integer, so you can compare with other integer
type correctly.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-16 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889334#action_12889334
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

To follow a backwards compatible behaviour I should group all the integers (for
example) into the same case statement and then implement all the logic there
(if it is INTEGER_0/1 do not read anything, if it is INTEGER_INBYTE read a
byte, etc...).
So in each case statement I would need to compare each data type with its
siblings, implement the logic to tell which one sorts first, then if the other
data type is not a sibling, impose the global sorting.
This will result in some quite convoluted code compared to the actual patch,
because I will not be able to compare data types directly.
Is this the desired behaviour?

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-16 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889368#action_12889368
 ] 

Daniel Dai commented on PIG-1295:
-

We need to compare a category of data type, not data type itself. For example:
{code}
if 
(bt1==INTEGER_0||bt1==INTEGER_1||bt1==INTEGER_INBYTE||bt1==INTEGER_INSHORT||bt1==INTEGER)
 {
type1 = INTEGER;
value1 = readInteger(bt1);
}
if 
(bt2==INTEGER_0||bt2==INTEGER_1||bt2==INTEGER_INBYTE||bt2==INTEGER_INSHORT||bt2==INTEGER)
 {
type2 = INTEGER;
value2 = readInteger(bt1);
}
if (type1==type2)
return (value1  value2 ? -1 : (value1 == value2 ? 0 : 1));
else {
return dt1-dt2;
}
{code}

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-14 Thread Thejas M Nair (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888517#action_12888517
]

Thejas M Nair commented on PIG-1295:

This sounds fine. In case of BinSedesTupleFactory/BinSedesTuple the
serialization code is in the subclass (BinInterSedes) of InterSedes class.
Since the comparator function is closely tied to serlialization logic, i think
that would be the appropriate place to have the comparator implementation
code. The BinSedesTupleFactory can return the class obtained from
InterSedesFactory.getInterSedesInstance().

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-14 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888529#action_12888529
]

Daniel Dai commented on PIG-1295:
-

If pig.data.tuple.factory.name is BinSedesTupleFactory, we should use your
raw comparator. DefaultTupleFactory is a minor case, we can use deserialization.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-12 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887420#action_12887420
]

Daniel Dai commented on PIG-1295:
-

More clarification for custom Tuple. There two cases for custom tuple:
1. User create custom tuple inside UDF. In this case, we do not have a special
serialized format for custom tuple. After serialization, we cannot tell if it
is a custom tuple. That is say, we lose track of tuple implementation after
se/des. Since serialized format is the same, we can still use the same raw
comparator.
2. If user use a custom tuple factory (by overriding
pig.data.tuple.factory.name), then serialized format may be changed. If we
detect that tuple factory is not BinSedesTupleFactory, we shall not use this
raw comparator.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-11 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887270#action_12887270
]

Daniel Dai commented on PIG-1295:
-

With the change of PIG-1472, we need to change raw comparator accordingly:
1. Bag comparison should be changed to compare TINYBAG/SMALLBAG/BAG
2. Tuple comparison should be changed to compare TINYTUPLE/SMALLTUPLE/TUPLE
3. Map comparison should be changed to compare TINYMAP/SMALLMAP/MAP
4. Integer comparison should be changed to compare
INTEGER_0/INTEGER_1/INTEGER_INBYTE/INTEGER_INSHORT/INTEGER
5. ByteArray comparison should be changed to compare
TINYBYTEARRAY/SMALLBYTEARRAY/BYTEARRAY
6. Chararray comparison should be changed to compare SMALLCHARARRAY/CHARARRAY
7. Raw comparator is now depend on the serialization format. Now we have two
serialization format, DefaultTuple and BinSedesTuple. It's better to move
PigTupleRawComparatorNew inside BinSedesTuple. But in this project, we only
focus on BinSedesTuple, which addres most use cases

In the integration code, we shall check if TupleFactory is actually
BinSedesTupleFactory, if it is, use this raw comparator; otherwise, use the
original comparator.

I was wrong for the customized tuple. we do not need a fall back scheme for
customized tuple. In the serialized format, all Tuples including customized
Tuple will be serialized into the same format.

Looks like UTF-8 encoding is convoluted, we can leave it for now.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-06 Thread Thejas M Nair (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885831#action_12885831
]

Thejas M Nair commented on PIG-1295:

1. If utf8 encoded ordering is same as java string compareTo ordering (i could
not find a quick answer), it will possible to compare the strings by just
comparing the bytes , instead of creating additional string objects.
2. The comparison does not handle chararrays 65k in length, ie when
BIGCHARARRAY is used as the type in encoding. This is a problem even in
existing pig code.
3. The comparison function is based on the DefaultTuple serialization format,
we need to make sure that it gets used only when DefaultTuple is the default
tuple. I am making changes to use a new tuple with different serialization
format in PIG-1472 . I think we should have this comparison logic defined in
the class/interface where the serialization format is defined. I think it
should be part of the InterSedes interface in the patch in PIG-1472 .

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-29 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883486#action_12883486
]

Hadoop QA commented on PIG-1295:

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch
against trunk revision 958666.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

-1 javac. The applied patch generated 150 javac compiler warnings (more
than the trunk's current 145 warnings).

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 release audit. The applied patch generated 402 release audit warnings
(more than the trunk's current 399 warnings).

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/
Release audit warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console

This message is automatically generated.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883361#action_12883361
]

Daniel Dai commented on PIG-1295:
-

Thanks, is the patch ready for review?

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883367#action_12883367
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I think it is

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-16 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879576#action_12879576
 ] 

Daniel Dai commented on PIG-1295:
-

I run your main problem. It also counts the tuple bytes generation time, which 
dominate the cpu profile. I run it in another way, generate the bytes first and 
don't include in the timing, and then compare the bytes using different 
comparator, Here is my code snippet:
{code}
byte[][] toCompare1 = new byte[TIMES][];
byte[][] toCompare2 = new byte[TIMES][];
NullableTuple t;
for (int i=0;iTIMES;i++) {
t = new NullableTuple(test.getRandomTuple(rand));
t.write(test.dos1);
toCompare1[i] = test.baos1.toByteArray();
}
for (int i=0;iTIMES;i++) {
t = new NullableTuple(test.getRandomTuple(rand));
t.write(test.dos2);
toCompare2[i] = test.baos2.toByteArray();
}

before = System.currentTimeMillis();
for (int loop = 0; loop  1; loop++) {
for (int i = 0; i  TIMES; i++) {
test.comparator.compare(toCompare1[i], 0, toCompare1[i].length, 
toCompare2[i], 0, toCompare2[i].length);
}
}
after = System.currentTimeMillis();

before = System.currentTimeMillis();
for (int loop = 0; loop  1; loop++) {
for (int i = 0; i  TIMES; i++) {
test.oldComparator.compare(toCompare1[i], 0, toCompare1[i].length, 
toCompare2[i], 0, toCompare2[i].length);
}
}
after = System.currentTimeMillis();
{code}

In this comparison, I see 6.5 times speedup.

Also notice the code style for Apache is always use space instead of tab, make 
sure you take care of this.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-11 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878120#action_12878120
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I did some more detailed profiling and testing. I ran a slightly modified
version of the performance tests in the patch.
On my machine, the new comparator takes between 8 and 10 seconds. The old one
between 11 and 14.
The performance advantage is not so big, but I found also that most of the time
of the test is not spent comparting.
The profiler shows that most of the time is spent in cloning (probably JUnit
internals?):

{code}
[junit] 84.0% 0 + 1769java.lang.Object.clone
{code}

For what concerns user code instead, most of the time is spent writing data.
Randomization of the input also takes some time.
The old comparator takes (0.9% + 0.3% | compareTuple + compare) more than 1% of
the time. The new one takes only 0.2% of the time.

{code}
[junit] 2.9%62 + 0
org.apache.pig.data.DataReaderWriter.writeDatum
[junit] 1.1%23 + 0java.io.DataOutputStream.writeInt
[junit] 0.9%19 + 0java.util.Random.next
[junit] 0.9%18 + 0java.io.DataOutputStream.writeLong
[junit] 0.9%18 + 0
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compareTuple
[junit] 0.6%11 + 1
org.apache.pig.data.DataReaderWriter.readDatum
[junit] 0.5%11 + 0java.io.DataInputStream.readInt
[junit] 0.5%11 + 0org.apache.pig.data.DefaultTuple.readFields
[junit] 0.3% 7 + 0
org.apache.pig.impl.io.PigNullableWritable.write
[junit] 0.3% 7 + 0org.apache.pig.data.DefaultTuple.init
[junit] 0.3% 3 + 3
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compare
[junit] 0.2% 5 + 0java.util.ArrayList.get
[junit] 0.2% 5 + 0org.apache.pig.data.DefaultTuple.write
[junit] 0.2% 3 + 2
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparatorNew.compare
{code}

All in all, these tests are probably not much representative, but I also feel
that the speedup we may get with a raw comparator is somewhat limited.
I am open to suggestion on how to modify the performance tests to make them
more representative.

Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-11 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878127#action_12878127
]

Daniel Dai commented on PIG-1295:
-

the new comparator takes between 8 and 10 seconds. The old one between 11 and
14, that is a 137% or 140% speedup. That's significant enough for us to go
forward.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-11 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878130#action_12878130
]

Daniel Dai commented on PIG-1295:
-

Can you also attach performance test code? I want to take a look. Thanks.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-07 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876354#action_12876354
]

Daniel Dai commented on PIG-1295:
-

I briefly review the patch, looks good. This is the approach we expected. Can
we do some initial performance test first?

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-04-07 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854607#action_12854607
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I have drafted my proposal at
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/azaroth/t127030843242

Any feedback is more than welcome.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-04-07 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854649#action_12854649
]

Daniel Dai commented on PIG-1295:
-

Thanks Gianmarco,
Proposal looks good. Besides unit test, we need to add some performance test in
both phase 1 and phase 2.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-31 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851736#action_12851736
]

Daniel Dai commented on PIG-1295:
-

Thanks Gianmarco,
My suggestion is to divide it into two step:
1. make binary comparator works
2. integrate it into the current Pig code

It is better to make sure we have quality deliverable for step 1 before we move
to step 2.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-30 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851453#action_12851453
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

Hi,
I have been reading the source code and the referenced PIG-1038 issue.

Probably Avro integration is too big of a project for GSoC, but implementing
the tuple binary comparator seems doable.
I will write a proposal, any advices for it?

My idea of the project's breakdown would be like this:

Identify the cases that can be optimized and the appropriate visitor for those.
Write a test unit for this optimization.
Implement the comparator knowing the data types of the tuple.
Write a second test unit with different types.
Write the logic to extract tuple boundary from schema information (I suppose
this optimization is possible only if the schema is known)
Try to extend it to the general case of complex data type as secondary key.

Thoughts?

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-25 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849645#action_12849645
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I will start to look at the documentation and source code of both Pig and Avro
to get a rough idea of the work to do.
Maybe having a look at the old patch would be good to identify the points in
the source code where changes are needed?

Where can I discuss this issue and ideas to solve it? mailing list? irc?
Do you have any other suggestion?

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-25 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849730#action_12849730
]

Dmitriy V. Ryaboy commented on PIG-1295:

Gianmarco, the ticket for Avro loader (PIG-794) is where you would discuss
this. The old patch can be more or less discarded -- the whole loader
interface changed recently to make this sort of thing much easier and more
powerful -- but it's the issue folks who are interested in this are tracking.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-25 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849817#action_12849817
]

Daniel Dai commented on PIG-1295:
-

Hi, Gianmarco,
Mailing list and Jira should be a good place to discuss your ideas. Dmitriy is
right, we should discuss Avro issues in PIG-794. If you are talking about
Google Summer of Code, Avro is definitely a candidate project for you to
work.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-23 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848809#action_12848809
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

Avro provides efficient binary comparison for tuples with generic schemas.
A simple way to implement this would be to adapt Pig to use Avro for
intermediate files.
This would allow to use generic schemas for keys and solve the genral problem
in an efficient and elegant way.

I would be glad to give it a try.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

36 matches

Site Navigation

Mail list logo

Footer information