[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921270#comment-13921270
 ] 

Chris Trezzo commented on YARN-1492:
------------------------------------

Thanks [~kkambatl] for the comments!

bq. Would SCM be a single point of failure? If yes, would anyone of the 
following approaches make sense.

Currently, yes, but it will be able to restart while preserving the current 
metadata about the cache. While the SCM is down, the shared cache would not be 
useable, but this would not affect job submission or execution (i.e. the 
applications could just continue operating without it using normal resource 
submission).

bq. Make SCM an AM. From YARN-896, the only sub-task that affects this would be 
the delegation tokens.

This is an interesting idea. We shied away from this initially because the SCM 
will be a long running process. Also, with this approach what would be your 
recommendation for service discovery? Clients would need to know which machine 
the SCM is running on. [~sjlee0] also brought up an interesting point: running 
it as an AM would make local persisted state tricky.

bq. Add an SCMMonitorService to the RM. If SCM is enabled, this service would 
start the SCM on one of the nodes and monitor it.

This sounds like a good idea. Originally we wanted to avoid having dependencies 
from the RM to the SCM. The HA/Failover part of the design needs to be expanded 
upon in the document (the first priority was to have SCM stateful restart).

bq. SCM Cleaner Service - the doc mentions the tension between frequency of 
cleaner and load on the RM. Can you elaborate? I was of the opinion that the RM 
is not involved in the caching at all.

When the cleaner service runs it must check with the RM which application ids 
belong to applications that are no longer running. This requires the SCM to 
query the RM. I can update the document and talk about this more explicitly.

bq. Cleaner protocol doesn't mention when the cleaner lock is cleared. I assume 
it is cleared on each exit path.

Correct. When the cleaner deletes the entry row in the SCM table the cleaner 
lock is deleted for that entry as well. I will update the document to clarify.

bq. Nit: ZK-based store - we can may be do this in the JIRA corresponding to 
the sub-task - how would this look like?

I can write up a more detailed design section for this in the document. We are 
also planning to have a local store as well (leveraging something like LevelDB 
or HSQLDB).

bq. More nit-picking: The rationale for not using in-memory and reconstructing 
seems to come from long-running applications. Given long-running applications 
don't benefit from the shared cache as much as the shorter ones, is this a huge 
concern?

The concern around the in-memory/reconstruction approach was more about long 
running applications holding the cleaner service hostage. It seems difficult 
for the SCM to prevent a long running application from using the shared cache. 
If a long running application decides to use the shared cache for a resource 
and the SCM restarts/crashes, then the cleaner service will not be able to run 
until the application has terminated. This seemed like a big enough 
vulnerability to block this approach.

> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: YARN-1492
>                 URL: https://issues.apache.org/jira/browse/YARN-1492
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to