Re: How to use cluster for large set of linux files

Manoj Samel Wed, 22 Jan 2014 13:13:09 -0800

Thanks Ognen, HDFS is the plan; I am just hitting a a issue when building
for HDFS hence local files for now.



On Wed, Jan 22, 2014 at 1:03 PM, Ognen Duzlevski <
[email protected]> wrote:

> Manoj,
>
> large is a relative term ;)
>
> NFS is a rather slow solution, at least that's always been my experience.
> However, it will work for smaller files.
>
> One way to do it is to put the files in S3 on Amazon. However, then your
> network becomes a limiting factor.
>
> The other way to do it is to replicate all the files on each node but that
> can get tedious and depends on how much disk space you have, may not be an
> option.
>
> Finally there are things like http://code.google.com/p/mogilefs/ but they
> seem to need a special library to read a file - probably would need some
> kind of patching of spark to make it work since it may not expose the usual
> filesystem interface. However, it could be a viable solution, I am just
> starting to play with it.
>
> Ognen
>
>
> On Wed, Jan 22, 2014 at 8:37 PM, Manoj Samel <[email protected]>wrote:
>
>> I have a set of csv files that I want to read as a single RDD using a
>> stand alone cluster.
>>
>> These file reside on one machine right now. If I start a cluster with
>> multiple worker nodes, how do I use these worker nodes to read the files
>> and do the RDD computation ? Do I have to copy the files on every worker
>> node ?
>>
>> Assume that copying these into a HDFS is not a option for now ..
>>
>> Thanks,
>>
>
>
>
> --
> "Le secret des grandes fortunes sans cause apparente est un crime oublié,
> parce qu'il a été proprement fait" - Honore de Balzac
>

Re: How to use cluster for large set of linux files

Reply via email to