Re: sc.parallelize with defaultParallelism=1
If you want to process the data locally, why do you need to use sc.parallelize? Store the data in regular Scala collections and use their methods to process them (they have pretty much the same set of methods as Spark RDDs). Then when you're happy, finally use Spark to process the pre-processed input data. Or you can run Spark in "local" mode, in which case the executor(s) run in the same VM as the master. Unless I'm misunderstanding what it is you're trying to achieve here? On Wed, Sep 30, 2015 at 10:25 AM, Nicolae Marasoiu wrote: > That's exactly what I am doing, but my question is does parallelize send the > data to a worker node. From a performance perspective on small sets, the > ideal would be to load in local jvm memory of the driver. I mean even > designating the current machine as a worker node, besides driver, would > still mean a localhost lo/net communication. I guess Spark is a batch > oriented system, and I am still checking if there are ways to use it like > this too, load data manually but process it with the functional & other > spark libraries but without the distribution or m/r part. > > > > > From: Andy Dang > Sent: Wednesday, September 30, 2015 8:17 PM > To: Nicolae Marasoiu > Cc: user@spark.apache.org > Subject: Re: sc.parallelize with defaultParallelism=1 > > Can't you just load the data from HBase first, and then call sc.parallelize > on your dataset? > > -Andy > > --- > Regards, > Andy (Nam) Dang > > On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu > wrote: >> >> Hi, >> >> >> When calling sc.parallelize(data,1), is there a preference where to put >> the data? I see 2 possibilities: sending it to a worker node, or keeping it >> on the driver program. >> >> >> I would prefer to keep the data local to the driver. The use case is when >> I need just to load a bit of data from HBase, and then compute over it e.g. >> aggregate, using Spark. >> >> >> Thanks, >> >> Nicu > > -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sc.parallelize with defaultParallelism=1
That's exactly what I am doing, but my question is does parallelize send the data to a worker node. From a performance perspective on small sets, the ideal would be to load in local jvm memory of the driver. I mean even designating the current machine as a worker node, besides driver, would still mean a localhost lo/net communication. I guess Spark is a batch oriented system, and I am still checking if there are ways to use it like this too, load data manually but process it with the functional & other spark libraries but without the distribution or m/r part. From: Andy Dang Sent: Wednesday, September 30, 2015 8:17 PM To: Nicolae Marasoiu Cc: user@spark.apache.org Subject: Re: sc.parallelize with defaultParallelism=1 Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu mailto:nicolae.maras...@adswizz.com>> wrote: Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from HBase, and then compute over it e.g. aggregate, using Spark. Thanks, Nicu
Re: sc.parallelize with defaultParallelism=1
Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > When calling sc.parallelize(data,1), is there a preference where to put > the data? I see 2 possibilities: sending it to a worker node, or keeping it > on the driver program. > > > I would prefer to keep the data local to the driver. The use case is when > I need just to load a bit of data from HBase, and then compute over it e.g. > aggregate, using Spark. > > > Thanks, > > Nicu >
sc.parallelize with defaultParallelism=1
Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from HBase, and then compute over it e.g. aggregate, using Spark. Thanks, Nicu