date:20100709

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-09 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886647#action_12886647
]

Hadoop QA commented on PIG-1472:

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12449033/PIG-1472.3.patch
against trunk revision 960062.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 69 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 release audit. The applied patch generated 395 release audit warnings
(more than the trunk's current 394 warnings).

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/testReport/
Release audit warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/console

This message is automatically generated.

Optimize serialization/deserialization between Map and Reduce and between MR
jobs
-

Key: PIG-1472
URL: https://issues.apache.org/jira/browse/PIG-1472
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Fix For: 0.8.0

Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch

In certain types of pig queries most of the execution time is spent in
serializing/deserializing (sedes) records between Map and Reduce and between
MR jobs.
For example, if PigMix queries are modified to specify types for all the
fields in the load statement schema, some of the queries (L2,L3,L9, L10 in
pigmix v1) that have records with bags and maps being transmitted across map
or reduce boundaries run a lot longer (runtime increase of few times has been
seen.
There are a few optimizations that have shown to improve the performance of
sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if
a bytearray is smaller than 255 bytes , a byte can be used to store the
length instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and
DataInput.readUTF. This reduces the cost of serialization by more than 1/2.
Zebra and BinStorage are known to use DefaultTuple sedes functionality. The
serialization format that these loaders use cannot change, so after the
optimization their format is going to be different from the format used
between M/R boundaries.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-07-09 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886772#action_12886772
 ] 

Daniel Dai commented on PIG-1434:
-

We may also add some sanity check, instead of just doing a limit.
{code}
C = foreach C generate CheckSingular(*);
Z = join X by 1, C by 1 using 'replicated';
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}

CheckSingular will check if C only have one record.

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 >

1 - 100 of 102 matches

Mail list logo