[jira] [Comment Edited] (CRUNCH-602) Combiner initialization repeatedly retrieves RT nodes from DistCache, leading to high NN load

2016-10-14 Thread Dmitry Goldin (JIRA)

[ 
https://issues.apache.org/jira/browse/CRUNCH-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576009#comment-15576009
 ] 

Dmitry Goldin edited comment on CRUNCH-602 at 10/14/16 5:56 PM:


To illustrate the impact of this issue, this is some data from two runs of one
of our larger jobs (on minimally different data, but same stage of the job, 
despite different temp-folder numbers):

*Without fix*
{code}
Num Mappers: 12879
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-1068439135/p10/COMBINE' hdfs-audit.log* | wc -l
12923584
{code}

*With fix*
{code}
Num Mappers: 12843
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-336925217/p9/COMBINE' hdfs-audit.log* | wc -l
4912
{code}


was (Author: dmi):
To illustrate the impact of this issue, this is some data from two runs of one
of our larger jobs (on minimally different data):

*Without fix*
{code}
Num Mappers: 12879
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-1068439135/p10/COMBINE' hdfs-audit.log* | wc -l
12923584
{code}

*With fix*
{code}
Num Mappers: 12843
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-336925217/p9/COMBINE' hdfs-audit.log* | wc -l
4912
{code}

> Combiner initialization repeatedly retrieves RT nodes from DistCache, leading 
> to high NN load
> -
>
> Key: CRUNCH-602
> URL: https://issues.apache.org/jira/browse/CRUNCH-602
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.12.0, 0.13.0, 0.14.0
> Environment: Crunch 0.14-SNAPSHOT, CDH5.6.0
>Reporter: Michael Rose
>Assignee: Josh Wills
>  Labels: performance
> Attachments: crunch-602.patch, quickfix_crunch_distcache.2.patch, 
> quickfix_crunch_distcache.patch
>
>
> When running one of our Crunch pipelines, we noticed our NameNode under very 
> heavy load. We run our masters on pretty light hardware, so our NN was 
> sitting at 100% CPU.
> Crunch reads the RTNodes during creation of a CrunchTaskContext. These are 
> created when Mappers and Reducers are created. Importantly, a CrunchCombiner 
> is a subclass of a Reducer, so each mapper will create R combiners where R is 
> the number of reducers and thus R CrunchTaskContexts. Consequently in highly 
> parallel jobs, this means M*R semi-expensive calls to the NameNode.
> In the constructor for CrunchTaskContext, this is the read to the DistCache:
> this.nodes = (List) DistCache.read(conf, path);
> Which then leads to a read into the NN + deserialization.
> For now, we took the overly simplistic approach of caching the results of the 
> DistCache read in a Guava cache. The cache ensures combiners reuse RTNodes 
> with only the overhead of deserialization which is somewhat unavoidable as 
> RTNodes are stateful and not reusable. However, it's not configurable except 
> by modifying code.
> I'll attach the patch, but given that it's not yet configurable I wouldn't 
> call it a "fix available." There may be much better ways of fixing this issue 
> as well -- if you have some guidance I'd be happy to do the legwork on a 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CRUNCH-602) Combiner initialization repeatedly retrieves RT nodes from DistCache, leading to high NN load

2016-10-14 Thread Dmitry Goldin (JIRA)

[ 
https://issues.apache.org/jira/browse/CRUNCH-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575926#comment-15575926
 ] 

Dmitry Goldin edited comment on CRUNCH-602 at 10/14/16 5:23 PM:


I noticed that I uploaded a slightly older version of my attempted fix than 
intended. This one actually uses the existing {{getPathToCacheFile}} method to 
retrieve the paths. Maybe this makes it a bit clearer.


was (Author: dmi):
I noticed that I uploaded a slighly older version of my attempted fix than 
intended. This one actually uses the existing {{getPathToCacheFile}} method to 
retrieve the paths. Maybe this makes it a bit clearer.

> Combiner initialization repeatedly retrieves RT nodes from DistCache, leading 
> to high NN load
> -
>
> Key: CRUNCH-602
> URL: https://issues.apache.org/jira/browse/CRUNCH-602
> Project: Crunch
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 0.12.0, 0.13.0, 0.14.0
> Environment: Crunch 0.14-SNAPSHOT, CDH5.6.0
>Reporter: Michael Rose
>Assignee: Josh Wills
>  Labels: performance
> Attachments: crunch-602.patch, quickfix_crunch_distcache.2.patch, 
> quickfix_crunch_distcache.patch
>
>
> When running one of our Crunch pipelines, we noticed our NameNode under very 
> heavy load. We run our masters on pretty light hardware, so our NN was 
> sitting at 100% CPU.
> Crunch reads the RTNodes during creation of a CrunchTaskContext. These are 
> created when Mappers and Reducers are created. Importantly, a CrunchCombiner 
> is a subclass of a Reducer, so each mapper will create R combiners where R is 
> the number of reducers and thus R CrunchTaskContexts. Consequently in highly 
> parallel jobs, this means M*R semi-expensive calls to the NameNode.
> In the constructor for CrunchTaskContext, this is the read to the DistCache:
> this.nodes = (List) DistCache.read(conf, path);
> Which then leads to a read into the NN + deserialization.
> For now, we took the overly simplistic approach of caching the results of the 
> DistCache read in a Guava cache. The cache ensures combiners reuse RTNodes 
> with only the overhead of deserialization which is somewhat unavoidable as 
> RTNodes are stateful and not reusable. However, it's not configurable except 
> by modifying code.
> I'll attach the patch, but given that it's not yet configurable I wouldn't 
> call it a "fix available." There may be much better ways of fixing this issue 
> as well -- if you have some guidance I'd be happy to do the legwork on a 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)