If the dataset is not huge (in a few GB), you can setup NFS instead of HDFS (which is much harder to setup):
1. export a directory in master (or anyone in the cluster) 2. mount it in the same position across all slaves 3. read/write from it by file:///path/to/monitpoint On Tue, Jan 20, 2015 at 7:55 AM, Wang, Ningjun (LNG-NPV) <[email protected]> wrote: > Can anybody answer this? Do I have to have hdfs to achieve this? > > > > Regards, > > > > Ningjun Wang > > Consulting Software Engineer > > LexisNexis > > 121 Chanlon Road > > New Providence, NJ 07974-1541 > > > > From: Wang, Ningjun (LNG-NPV) [mailto:[email protected]] > Sent: Friday, January 16, 2015 1:15 PM > To: Imran Rashid > Cc: [email protected] > Subject: RE: Can I save RDD to local file system and then read it back on > spark cluster with multiple nodes? > > > > I need to save RDD to file system and then restore my RDD from the file > system in the future. I don’t have any hdfs file system and don’t want to go > the hassle of setting up a hdfs system. So how can I achieve this? The > application need to be run on a cluster with multiple nodes. > > > > Regards, > > > > Ningjun > > > > From: [email protected] [mailto:[email protected]] On Behalf Of Imran > Rashid > Sent: Friday, January 16, 2015 12:14 PM > To: Wang, Ningjun (LNG-NPV) > Cc: [email protected] > Subject: Re: Can I save RDD to local file system and then read it back on > spark cluster with multiple nodes? > > > > I'm not positive, but I think this is very unlikely to work. > > First, when you call sc.objectFile(...), I think the *driver* will need to > know something about the file, eg to know how many tasks to create. But it > won't even be able to see the file, since it only lives on the local > filesystem of the cluster nodes. > > If you really wanted to, you could probably write out some small metadata > about the files and write your own version of objectFile that uses it. But > I think there is a bigger conceptual issue. You might not in general be > sure that you are running on the same nodes when you save the file, as when > you read it back in. So the file might not be present on the local > filesystem for the active executors. You might be able to guarantee it for > the specific cluster setup you have now, but it might limit you down the > road. > > What are you trying to achieve? There might be a better way. I believe > writing to hdfs will usually write one local copy, so you'd still be doing a > local read when you reload the data. > > Imran > > On Jan 16, 2015 6:19 AM, "Wang, Ningjun (LNG-NPV)" > <[email protected]> wrote: > > I have asked this question before but get no answer. Asking again. > > > > Can I save RDD to the local file system and then read it back on a spark > cluster with multiple nodes? > > > > rdd.saveAsObjectFile(“file:///home/data/rdd1”) > > > > val rdd2 = sc.objectFile(“file:///home/data/rdd1”) > > > > This will works if the cluster has only one node. But my cluster has 3 nodes > and each node has a local dir called /home/data. Is rdd saved to the local > dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read the > local dir in all 3 nodes to merge them into a single rdd? > > > > Ningjun > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
