Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

Davies Liu Tue, 20 Jan 2015 16:11:15 -0800

If the dataset is not huge (in a few GB), you can setup NFS instead of
HDFS (which is much harder to setup):


1. export a directory in master (or anyone in the cluster)
2. mount it in the same position across all slaves
3. read/write from it by file:///path/to/monitpoint

On Tue, Jan 20, 2015 at 7:55 AM, Wang, Ningjun (LNG-NPV)
<[email protected]> wrote:
> Can anybody answer this? Do I have to have hdfs to achieve this?
>
>
>
> Regards,
>
>
>
> Ningjun Wang
>
> Consulting Software Engineer
>
> LexisNexis
>
> 121 Chanlon Road
>
> New Providence, NJ 07974-1541
>
>
>
> From: Wang, Ningjun (LNG-NPV) [mailto:[email protected]]
> Sent: Friday, January 16, 2015 1:15 PM
> To: Imran Rashid
> Cc: [email protected]
> Subject: RE: Can I save RDD to local file system and then read it back on
> spark cluster with multiple nodes?
>
>
>
> I need to save RDD to file system and then restore my RDD from the file
> system in the future. I don’t have any hdfs file system and don’t want to go
> the hassle of setting up a hdfs system. So how can I achieve this? The
> application need to be run on a cluster with multiple nodes.
>
>
>
> Regards,
>
>
>
> Ningjun
>
>
>
> From: [email protected] [mailto:[email protected]] On Behalf Of Imran
> Rashid
> Sent: Friday, January 16, 2015 12:14 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: [email protected]
> Subject: Re: Can I save RDD to local file system and then read it back on
> spark cluster with multiple nodes?
>
>
>
> I'm not positive, but I think this is very unlikely to work.
>
> First, when you call sc.objectFile(...),  I think the *driver* will need to
> know something about the file, eg to know how many tasks to create.  But it
> won't even be able to see the file, since it only lives on the local
> filesystem of the cluster nodes.
>
> If you really wanted to, you could probably write out some small metadata
> about the files and write your own version of objectFile that uses it.  But
> I think there is a bigger conceptual issue.  You might not in general be
> sure that you are running on the same nodes when you save the file, as when
> you read it back in.  So the file might not be present on the local
> filesystem for the active executors.  You might be able to guarantee it for
> the specific cluster setup you have now, but it might limit you down the
> road.
>
> What are you trying to achieve?  There might be a better way.  I believe
> writing to hdfs will usually write one local copy, so you'd still be doing a
> local read when you reload the data.
>
> Imran
>
> On Jan 16, 2015 6:19 AM, "Wang, Ningjun (LNG-NPV)"
> <[email protected]> wrote:
>
> I have asked this question before but get no answer. Asking again.
>
>
>
> Can I save RDD to the local file system and then read it back on a spark
> cluster with multiple nodes?
>
>
>
> rdd.saveAsObjectFile(“file:///home/data/rdd1”)
>
>
>
> val rdd2 = sc.objectFile(“file:///home/data/rdd1”)
>
>
>
> This will works if the cluster has only one node. But my cluster has 3 nodes
> and each node has a local dir called /home/data. Is rdd saved to the local
> dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read the
> local dir in all 3 nodes to merge them into a single rdd?
>
>
>
> Ningjun
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

Reply via email to