[jira] [Comment Edited] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Xuefu Zhang (JIRA) Thu, 06 Nov 2014 21:38:24 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201621#comment-14201621
 ]


Xuefu Zhang edited comment on SPARK-4290 at 11/7/14 5:37 AM:
-------------------------------------------------------------

Yes, SparkContext#addFile() seems to be what we need. If the files can be more 
efficiently broadcast to every executor, that's even better than distributed 
cache. In the meantime, we can set a large replication factor for the files to 
mitigate the problem.

To clarify, [~sandyr], [~rxin], do files added via SparkContext.addFile() get 
automatically downloaded to executor, or SparkFiles.get() has to be called in 
order to make that happen?


was (Author: xuefuz):
Yes, SparkContext#addFile() seems to be what we need. If the files can be more 
efficiently broadcast to every executor, that's even better than distributed 
cache. In the meantime, we can set a large replication factor for the files to 
mitigate the problem.

> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
>                 Key: SPARK-4290
>                 URL: https://issues.apache.org/jira/browse/SPARK-4290
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Reply via email to