[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887441#action_12887441
 ] 

Thejas M Nair commented on PIG-1472:


bq. 1. The following code are never used in BinStorage and InterStorage, should 
be removed. 
I will remove that.

bq. 3. Seems InterStorage is a replacement for BinStorage, why do we make it 
private? Shall we encourage user use InterStorage in the place of BinStorage, 
and make BinStorage deprecate?
In future, we are likely to find better ways to serialize data between MR jobs 
of a pig query. ie the InterSedes serialization format is likely to change, and 
the change is not likely to be compatible with its old format. So it will not 
be suitable for storing persistent data. 
This replaces BinStorage only for its use within pig. Since BinStorage is used 
in pig queries and it should be easy to maintain the code, I think we don't 
have to deprecate BinStorage.



 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887545#action_12887545
 ] 

Daniel Dai commented on PIG-1472:
-

+1 for commit.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
 PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887265#action_12887265
 ] 

Daniel Dai commented on PIG-1472:
-

Patch looks good. Couple of comments:
1. The following code are never used in BinStorage and InterStorage, should be 
removed.
{code}
public static final int RECORD_1 = 0x01;
public static final int RECORD_2 = 0x02;
public static final int RECORD_3 = 0x03;
{code}

2. In BinInterSedes, why do we have type GENERIC_WRITABLECOMPARABLE? When it 
will be used?

3. Seems InterStorage is a replacement for BinStorage, why do we make it 
private? Shall we encourage user use InterStorage in the place of BinStorage, 
and make BinStorage deprecate?

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887267#action_12887267
 ] 

Daniel Dai commented on PIG-1472:
-

Forget 2, GENERIC_WRITABLECOMPARABLE also in DataReaderWriter, we just follow.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886647#action_12886647
 ] 

Hadoop QA commented on PIG-1472:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12449033/PIG-1472.3.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 69 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 395 release audit warnings 
(more than the trunk's current 394 warnings).

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/console

This message is automatically generated.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-09 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886820#action_12886820
 ] 

Thejas M Nair commented on PIG-1472:


The audit warning diff looks bogus. The contrib tests passed when i ran them on 
my machine, failures seem to be caused by hudson environment.

The changes in PIG-1295 will need to be ported to work with this new 
serialization format. For that patch, I think we should introduce a new 
functions in InterSedes that can compare two serialized tuples. Also add a 
function to BinSedesTuple that returns corresponding InterSedes class. 
Then while selecting the comparator, add a check to see if the default tuple 
type is BinSedesTuple, if yes, use the corresponding InterSedes function as the 
comparator class.  


 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886281#action_12886281
 ] 

Hadoop QA commented on PIG-1472:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448937/PIG-1472.2.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 69 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

-1 javac.  The applied patch generated 148 javac compiler warnings (more 
than the trunk's current 145 warnings).

-1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

-1 release audit.  The applied patch generated 400 release audit warnings 
(more than the trunk's current 399 warnings).

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/console

This message is automatically generated.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-06 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885835#action_12885835
 ] 

Thejas M Nair commented on PIG-1472:


I ran the pigmix1 queries with changes to specify types for columns in load 
function, with this patch, and I see significant performance improvements for 
following queries -
The time is in secs, and the time in table is the best out of 3 runs. The 
queries that show significant improvement ( 10%) are ones where bag/map 
columns are written across Map/Reduce boundaries. (In pigmix v1 L2, L3 queries 
don't prune the bag and map columns before group-by or join.)

||pigmix1 query || before patch || after patch|| % diff ||
|L1 | 211 | 198 | 6%|
| L2 | 514|424 | 17% |
| L3| 670|541 | 19%|
| L4 | 133|123 | 7.5%| 
| L5| 118|113 | | 
| L6 | 139|134 | |
| L7 | 114|114 | | 
| L8 | 68|69 | | 
| L9| 1113|957 | 14%| 
| L10 | 1153| 998| 13.4%| 
| L11 | 317| 317| | 
| L12 | 124| 123| | 


 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.