Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

Imran Rashid Fri, 16 Jan 2015 09:15:18 -0800

I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to
know something about the file, eg to know how many tasks to create.  But it
won't even be able to see the file, since it only lives on the local
filesystem of the cluster nodes.

If you really wanted to, you could probably write out some small metadata
about the files and write your own version of objectFile that uses it.  But
I think there is a bigger conceptual issue.  You might not in general be
sure that you are running on the same nodes when you save the file, as when
you read it back in.  So the file might not be present on the local
filesystem for the active executors.  You might be able to guarantee it for
the specific cluster setup you have now, but it might limit you down the
road.

What are you trying to achieve?  There might be a better way.  I believe
writing to hdfs will usually write one local copy, so you'd still be doing
a local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, "Wang, Ningjun (LNG-NPV)" <
[email protected]> wrote:

>  I have asked this question before but get no answer. Asking again.
>
>
>
> Can I save RDD to the local file system and then read it back on a spark
> cluster with multiple nodes?
>
>
>
> rdd.saveAsObjectFile(“file:///home/data/rdd1”)
>
>
>
> val rdd2 = sc.objectFile(“file:///home/data/rdd1”)
>
>
>
> This will works if the cluster has only one node. But my cluster has 3
> nodes and each node has a local dir called /home/data. Is rdd saved to the
> local dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read
> the local dir in all 3 nodes to merge them into a single rdd?
>
>
>
> Ningjun
>
>
>

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

Reply via email to