Re: How to use cluster for large set of linux files

Matei Zaharia Wed, 22 Jan 2014 14:34:11 -0800

When you do foreach(println) on a cluster, that calls println *on the worker 
nodes*, so the output goes to their stdout and stderr files instead of to your 
shell. To make sure it loaded the file you should use operations that return 
the data locally, like .first() or .take().


Matei

On Jan 22, 2014, at 12:56 PM, Manoj Samel <[email protected]> wrote:

> Thanks Matei.
> 
> One thing I noticed after doing this and starting MASTER=spark://xxxx 
> spark-shell is everything works , BUT the xxx.foreach(println) prints blank 
> line. All other logic seems working. If I do a xx.count etc, I can see the 
> value, just the println does not seems working
> 
> 
> On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia <[email protected]> 
> wrote:
> Hi Manoj,
> 
> You’d have to make the files available at the same path on each machine 
> through something like NFS. You don’t need to copy them, though that would 
> also work.
> 
> Matei
> 
> On Jan 22, 2014, at 12:37 PM, Manoj Samel <[email protected]> wrote:
> 
> > I have a set of csv files that I want to read as a single RDD using a stand 
> > alone cluster.
> >
> > These file reside on one machine right now. If I start a cluster with 
> > multiple worker nodes, how do I use these worker nodes to read the files 
> > and do the RDD computation ? Do I have to copy the files on every worker 
> > node ?
> >
> > Assume that copying these into a HDFS is not a option for now ..
> >
> > Thanks,
> 
>

Re: How to use cluster for large set of linux files

Reply via email to