[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912144#comment-13912144
 ] 

Karthik Kambatla commented on YARN-1492:
----------------------------------------

Thanks for sharing this, [~ctrezzo]. The document is nicely written. Few 
comments:
* Would SCM be a single point of failure? If yes, would anyone of the following 
approaches make sense.
** Make SCM an AM. From YARN-896, the only sub-task that affects this would be 
the delegation tokens. 
** Add an SCMMonitorService to the RM. If SCM is enabled, this service would 
start the SCM on one of the nodes and monitor it. 
* SCM Cleaner Service - the doc mentions the tension between frequency of 
cleaner and load on the RM. Can you elaborate? I was of the opinion that the RM 
is not involved in the caching at all. 
* Cleaner protocol doesn't mention when the cleaner lock is cleared. I assume 
it is cleared on each exit path. 
* Nit: ZK-based store - we can may be do this in the JIRA corresponding to the 
sub-task - how would this look like? 
* More nit-picking: The rationale for not using in-memory and reconstructing 
seems to come from long-running applications. Given long-running applications 
don't benefit from the shared cache as much as the shorter ones, is this a huge 
concern? 

> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: YARN-1492
>                 URL: https://issues.apache.org/jira/browse/YARN-1492
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to