[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-12 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799370#action_12799370
 ] 

Ying He commented on PIG-480:
-

I did some tests with larger data set, and the results are consistent 
with what we saw before.  I didn't run skewed data with no combiner, 
because it kept running out of space.

1. skewed data
combiner  job 1   job 2   total
patch46min 3min 38sec  49min 38sec  
 
trunk   24min 32sec6min 53sec   31min 25sec

combiner and skewed join
patch6min 40sec3min 58sec10min 38sec
trunk8min 41sec8min 32sec17min 13sec

2. uniform data
combiner
patch   13min 18sec   7min 9sec   20min 27sec
trunk   19min 1sec13min 25sec32min 26sec

no combiner
patch  18min 21sec   37min 4sec  55min 25sec
trunk  16min 31sec   40min 3sec  56min 34sec

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-12 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799373#action_12799373
 ] 

Alan Gates commented on PIG-480:


So this code definitely wins in some instances and looses in others.  I propose 
that we do include the functionality, but that we define a property that will 
turn it off (something like -Dpig.exec.noidentitymap or something) and clearly 
document the case where users would want to turn it off.

Thoughts?

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-12 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799376#action_12799376
 ] 

Ying He commented on PIG-480:
-

the option to turn it off is already there. Use
-Dopt.identitymap=false 

to turn it off.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797412#action_12797412
 ] 

Ying He commented on PIG-480:
-

I did more performance tests. It shows the performance is related to the 
nature of data. If the data is skewed, performance is very bad for 
combiner case. If data is uniform,  the combiner case gets the most 
performance gain.  The test is done by using a join then a group by 
statement.

For skewed data, if I use skewed join, the result is much better.  I 
think the reason of bad performance for skewed data is that because the 
map plan of second job is moved to the reducer of first job. If data is 
skewed, a single reducer has to execute the extra logic for all its 
tuples. While without this patch, that part of logic would be executed 
inside multiple mappers. So we lost parallelism for this.  The more 
skewed the data is, the worse the performance would be. 

1. skewed data
combiner   job 1 job 2 total
patch 7min 53sec  1min 1sec8min 54sec
trunk 4min 43sec  1min 37sec  6min 20sec

combiner and using skewed join
patch1min 55sec  1min 1sec 2min 56sec
trunk1min 44sec  1min 40sec   3min 24sec

no combiner
patch2min 26sec  2min 28sec 4min 54sec
trunk1min 25sec  3min 24sec  4min 49sec

no combiner and using skewed join
patch   1min 17sec  3min 5sec   4min 22sec
trunk59sec   3min 7sec   4min 6sec

2. uniform data
combiner
patch   6min 48sec  3min 43sec10min 31sec
trunk7min 32sec  7min 3sec  14min 35sec

no combiner
patch   1min 25sec  2min 25sec 3min 50sec
trunk   1min 24sec  2min 28sec 3min 52sec

each group of tests may use different data, so don't make cross group 
comparison.


 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2010-01-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797518#action_12797518
 ] 

Hadoop QA commented on PIG-480:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429598/PIG_480.patch
  against trunk revision 896606.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 230 javac compiler warnings (more 
than the trunk's current 212 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 482 release audit warnings 
(more than the trunk's current 481 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/console

This message is automatically generated.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792569#action_12792569
 ] 

Alan Gates commented on PIG-480:


What kind of performance gain do we get from this?  The only PigMIx query that 
looks like it would be directly affected is PigMix_3.  It would be interesting 
to run that and a few other queries that we expect would benefit from this to 
measure the performance improvements.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-07 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787060#action_12787060
 ] 

Ying He commented on PIG-480:
-

The javac warnings are caused by the references to hadoop deprecated API. The 
release audit warning is for html file.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785801#action_12785801
 ] 

Hadoop QA commented on PIG-480:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426804/PIG_480.patch
  against trunk revision 887049.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 217 javac compiler warnings (more 
than the trunk's current 213 warnings).

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/88/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/88/console

This message is automatically generated.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.