[ https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201621#comment-14201621 ]
Xuefu Zhang edited comment on SPARK-4290 at 11/7/14 5:37 AM: ------------------------------------------------------------- Yes, SparkContext#addFile() seems to be what we need. If the files can be more efficiently broadcast to every executor, that's even better than distributed cache. In the meantime, we can set a large replication factor for the files to mitigate the problem. To clarify, [~sandyr], [~rxin], do files added via SparkContext.addFile() get automatically downloaded to executor, or SparkFiles.get() has to be called in order to make that happen? was (Author: xuefuz): Yes, SparkContext#addFile() seems to be what we need. If the files can be more efficiently broadcast to every executor, that's even better than distributed cache. In the meantime, we can set a large replication factor for the files to mitigate the problem. > Provide an equivalent functionality of distributed cache as MR does > ------------------------------------------------------------------- > > Key: SPARK-4290 > URL: https://issues.apache.org/jira/browse/SPARK-4290 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Xuefu Zhang > > MapReduce allows client to specify files to be put in distributed cache for a > job and the framework guarentees that the file will be available in local > file system of a node where a task of the job runs and before the tasks > actually starts. While this might be achieved with Yarn via hacks, it's not > available in other clusters. It would be nice to have such an equivalent > functionality like this in Spark. > It would also complement Spark's broadcast variable, which may not be > suitable in certain scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org