date:20100711

[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

2010-07-11 Thread Gianmarco De Francisci Morales (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887178#action_12887178
 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-

It is quite easy to fix DataType.compare() to keep into account the unsigned 
logic.
But I am starting to feel that all of this is probably not worth the trouble.
This would make DataType.compare() for Bytes different from Byte.compareTo().


 DataByteArray.compareTo() does not compare in lexicographic order
 -

 Key: PIG-1468
 URL: https://issues.apache.org/jira/browse/PIG-1468
 Project: Pig
  Issue Type: Bug
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
 Attachments: PIG-1468.patch


 The compareTo() method of org.apache.pig.data.DataByteArray does not compare 
 items in lexicographic order.
 Actually, it takes into account the signum of the bytes that compose the 
 DataByteArray.
 So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-07-11 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.8.patch

I added the code for if the user does not use DefaultTuple we fall back to the
default deserialization case. I assume the user defined tuple will have a
different DataType byte from DataType.TUPLE. If this is not the case, I see no
way of discerning DefaultTuple from any other Tuple implementation.
Anyway, I think this issue needs to be properly addressed in the context of
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472].

I added support for BIGCHARARRAY.

UTF-8 decoding is quite convoluted. It is a variable length encoding, so we
cannot avoid using a String. [UTF-8|http://en.wikipedia.org/wiki/UTF-8]

Before tackling the integration with
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472] I need to familiarize
with the code in the patch. I will write a proposal for the integration in the
next days.

I also made some changes to DataByteArray in order to encapsulate the logic for
comparison in a publicly accessible method. This way the raw comparison is
consistent with the behavior of the class, in a way similar to the other cases
where I delegate comparison to the class.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
Fix For: 0.8.0

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-11 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887265#action_12887265
 ] 

Daniel Dai commented on PIG-1472:
-

Patch looks good. Couple of comments:
1. The following code are never used in BinStorage and InterStorage, should be 
removed.
{code}
public static final int RECORD_1 = 0x01;
public static final int RECORD_2 = 0x02;
public static final int RECORD_3 = 0x03;
{code}

2. In BinInterSedes, why do we have type GENERIC_WRITABLECOMPARABLE? When it 
will be used?

3. Seems InterStorage is a replacement for BinStorage, why do we make it 
private? Shall we encourage user use InterStorage in the place of BinStorage, 
and make BinStorage deprecate?

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-11 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887267#action_12887267
]

Daniel Dai commented on PIG-1472:
-

Forget 2, GENERIC_WRITABLECOMPARABLE also in DataReaderWriter, we just follow.

Optimize serialization/deserialization between Map and Reduce and between MR
jobs
-

Key: PIG-1472
URL: https://issues.apache.org/jira/browse/PIG-1472
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Fix For: 0.8.0

Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch

In certain types of pig queries most of the execution time is spent in
serializing/deserializing (sedes) records between Map and Reduce and between
MR jobs.
For example, if PigMix queries are modified to specify types for all the
fields in the load statement schema, some of the queries (L2,L3,L9, L10 in
pigmix v1) that have records with bags and maps being transmitted across map
or reduce boundaries run a lot longer (runtime increase of few times has been
seen.
There are a few optimizations that have shown to improve the performance of
sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if
a bytearray is smaller than 255 bytes , a byte can be used to store the
length instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and
DataInput.readUTF. This reduces the cost of serialization by more than 1/2.
Zebra and BinStorage are known to use DefaultTuple sedes functionality. The
serialization format that these loaders use cannot change, so after the
optimization their format is going to be different from the format used
between M/R boundaries.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-11 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887270#action_12887270
]

Daniel Dai commented on PIG-1295:
-

With the change of PIG-1472, we need to change raw comparator accordingly:
1. Bag comparison should be changed to compare TINYBAG/SMALLBAG/BAG
2. Tuple comparison should be changed to compare TINYTUPLE/SMALLTUPLE/TUPLE
3. Map comparison should be changed to compare TINYMAP/SMALLMAP/MAP
4. Integer comparison should be changed to compare
INTEGER_0/INTEGER_1/INTEGER_INBYTE/INTEGER_INSHORT/INTEGER
5. ByteArray comparison should be changed to compare
TINYBYTEARRAY/SMALLBYTEARRAY/BYTEARRAY
6. Chararray comparison should be changed to compare SMALLCHARARRAY/CHARARRAY
7. Raw comparator is now depend on the serialization format. Now we have two
serialization format, DefaultTuple and BinSedesTuple. It's better to move
PigTupleRawComparatorNew inside BinSedesTuple. But in this project, we only
focus on BinSedesTuple, which addres most use cases

In the integration code, we shall check if TupleFactory is actually
BinSedesTupleFactory, if it is, use this raw comparator; otherwise, use the
original comparator.

I was wrong for the customized tuple. we do not need a fall back scheme for
customized tuple. In the serialized format, all Tuples including customized
Tuple will be serialized into the same format.

Looks like UTF-8 encoding is convoluted, we can leave it for now.

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1493) Column Pruner throw exception inconsistent pruning

2010-07-11 Thread Daniel Dai (JIRA)

Column Pruner throw exception inconsistent pruning


 Key: PIG-1493
 URL: https://issues.apache.org/jira/browse/PIG-1493
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0, 0.7.0


The following script fail:
{code}
a = load '1.txt' as (a0:chararray, a1:chararray, a2);
b = foreach a generate CONCAT(a0,a1) as b0, a0, a2;
c = foreach b generate a0, a2;
dump c;
{code}

Error message:
ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) 
inconsistent pruning

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open 
iterator for alias c
at org.apache.pig.PigServer.openIterator(PigServer.java:698)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:451)
at org.apache.pig.Main.main(Main.java:103)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
Unable to store alias c
at org.apache.pig.PigServer.storeEx(PigServer.java:804)
at org.apache.pig.PigServer.store(PigServer.java:760)
at org.apache.pig.PigServer.openIterator(PigServer.java:680)
... 7 more
Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: 
Unable to prune plan
at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240)
at org.apache.pig.PigServer.compileLp(PigServer.java:1180)
at org.apache.pig.PigServer.storeEx(PigServer.java:799)
... 9 more
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot prune 
columns for (Name: b: ForEach 1-50 Operator Key: 1-50)
at 
org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177)
at 
org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821)
... 12 more
Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: 
Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning
at 
org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning

2010-07-11 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1493:


Attachment: PIG-1493-1.patch

 Column Pruner throw exception inconsistent pruning
 

 Key: PIG-1493
 URL: https://issues.apache.org/jira/browse/PIG-1493
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1493-1.patch


 The following script fail:
 {code}
 a = load '1.txt' as (a0:chararray, a1:chararray, a2);
 b = foreach a generate CONCAT(a0,a1) as b0, a0, a2;
 c = foreach b generate a0, a2;
 dump c;
 {code}
 Error message:
 ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) 
 inconsistent pruning
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias c
 at org.apache.pig.PigServer.openIterator(PigServer.java:698)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
 at org.apache.pig.Main.run(Main.java:451)
 at org.apache.pig.Main.main(Main.java:103)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
 Unable to store alias c
 at org.apache.pig.PigServer.storeEx(PigServer.java:804)
 at org.apache.pig.PigServer.store(PigServer.java:760)
 at org.apache.pig.PigServer.openIterator(PigServer.java:680)
 ... 7 more
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: 
 Unable to prune plan
 at 
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240)
 at org.apache.pig.PigServer.compileLp(PigServer.java:1180)
 at org.apache.pig.PigServer.storeEx(PigServer.java:799)
 ... 9 more
 Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot 
 prune columns for (Name: b: ForEach 1-50 Operator Key: 1-50)
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177)
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821)
 ... 12 more
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: 
 Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning

2010-07-11 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1493:


Status: Patch Available  (was: Open)

 Column Pruner throw exception inconsistent pruning
 

 Key: PIG-1493
 URL: https://issues.apache.org/jira/browse/PIG-1493
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0, 0.7.0

 Attachments: PIG-1493-1.patch


 The following script fail:
 {code}
 a = load '1.txt' as (a0:chararray, a1:chararray, a2);
 b = foreach a generate CONCAT(a0,a1) as b0, a0, a2;
 c = foreach b generate a0, a2;
 dump c;
 {code}
 Error message:
 ERROR 2185: Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) 
 inconsistent pruning
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias c
 at org.apache.pig.PigServer.openIterator(PigServer.java:698)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:595)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:291)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
 at org.apache.pig.Main.run(Main.java:451)
 at org.apache.pig.Main.main(Main.java:103)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
 Unable to store alias c
 at org.apache.pig.PigServer.storeEx(PigServer.java:804)
 at org.apache.pig.PigServer.store(PigServer.java:760)
 at org.apache.pig.PigServer.openIterator(PigServer.java:680)
 ... 7 more
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2212: 
 Unable to prune plan
 at 
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:826)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:240)
 at org.apache.pig.PigServer.compileLp(PigServer.java:1180)
 at org.apache.pig.PigServer.storeEx(PigServer.java:799)
 ... 9 more
 Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot 
 prune columns for (Name: b: ForEach 1-50 Operator Key: 1-50)
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:177)
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:202)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:132)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:47)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.prune(PruneColumns.java:821)
 ... 12 more
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: 
 Column $0 of (Name: b: ForEach 1-50 Operator Key: 1-50) inconsistent pruning
 at 
 org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:148)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Created: (PIG-1493) Column Pruner throw exception inconsistent pruning

[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning

[jira] Updated: (PIG-1493) Column Pruner throw exception inconsistent pruning

8 matches

Site Navigation

Mail list logo

Footer information