Thanks Ognen, HDFS is the plan; I am just hitting a a issue when building for HDFS hence local files for now.
On Wed, Jan 22, 2014 at 1:03 PM, Ognen Duzlevski < [email protected]> wrote: > Manoj, > > large is a relative term ;) > > NFS is a rather slow solution, at least that's always been my experience. > However, it will work for smaller files. > > One way to do it is to put the files in S3 on Amazon. However, then your > network becomes a limiting factor. > > The other way to do it is to replicate all the files on each node but that > can get tedious and depends on how much disk space you have, may not be an > option. > > Finally there are things like http://code.google.com/p/mogilefs/ but they > seem to need a special library to read a file - probably would need some > kind of patching of spark to make it work since it may not expose the usual > filesystem interface. However, it could be a viable solution, I am just > starting to play with it. > > Ognen > > > On Wed, Jan 22, 2014 at 8:37 PM, Manoj Samel <[email protected]>wrote: > >> I have a set of csv files that I want to read as a single RDD using a >> stand alone cluster. >> >> These file reside on one machine right now. If I start a cluster with >> multiple worker nodes, how do I use these worker nodes to read the files >> and do the RDD computation ? Do I have to copy the files on every worker >> node ? >> >> Assume that copying these into a HDFS is not a option for now .. >> >> Thanks, >> > > > > -- > "Le secret des grandes fortunes sans cause apparente est un crime oublié, > parce qu'il a été proprement fait" - Honore de Balzac >
