subject:"sc.parallelize with defaultParallelism=1"

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Marcelo Vanzin

If you want to process the data locally, why do you need to use sc.parallelize?

Store the data in regular Scala collections and use their methods to
process them (they have pretty much the same set of methods as Spark
RDDs). Then when you're happy, finally use Spark to process the
pre-processed input data.

Or you can run Spark in "local" mode, in which case the executor(s)
run in the same VM as the master.

Unless I'm misunderstanding what it is you're trying to achieve here?


On Wed, Sep 30, 2015 at 10:25 AM, Nicolae Marasoiu
 wrote:
> That's exactly what I am doing, but my question is does parallelize send the
> data to a worker node. From a performance perspective on small sets, the
> ideal would be to load in local jvm memory of the driver. I mean even
> designating the current machine as a worker node, besides driver, would
> still mean a localhost lo/net communication. I guess Spark is a batch
> oriented system, and I am still checking if there are ways to use it like
> this too, load data manually but process it with the functional & other
> spark libraries but without the distribution or m/r part.
>
>
>
> 
> From: Andy Dang 
> Sent: Wednesday, September 30, 2015 8:17 PM
> To: Nicolae Marasoiu
> Cc: user@spark.apache.org
> Subject: Re: sc.parallelize with defaultParallelism=1
>
> Can't you just load the data from HBase first, and then call sc.parallelize
> on your dataset?
>
> -Andy
>
> ---
> Regards,
> Andy (Nam) Dang
>
> On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu
>  wrote:
>>
>> Hi,
>>
>>
>> When calling sc.parallelize(data,1), is there a preference where to put
>> the data? I see 2 possibilities: sending it to a worker node, or keeping it
>> on the driver program.
>>
>>
>> I would prefer to keep the data local to the driver. The use case is when
>> I need just to load a bit of data from HBase, and then compute over it e.g.
>> aggregate, using Spark.
>>
>>
>> Thanks,
>>
>> Nicu
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu

That's exactly what I am doing, but my question is does parallelize send the 
data to a worker node. From a performance perspective on small sets, the ideal 
would be to load in local jvm memory of the driver. I mean even designating the 
current machine as a worker node, besides driver, would still mean a localhost 
lo/net communication. I guess Spark is a batch oriented system, and I am still 
checking if there are ways to use it like this too, load data manually but 
process it with the functional & other spark libraries but without the 
distribution or m/r part.



From: Andy Dang 
Sent: Wednesday, September 30, 2015 8:17 PM
To: Nicolae Marasoiu
Cc: user@spark.apache.org
Subject: Re: sc.parallelize with defaultParallelism=1

Can't you just load the data from HBase first, and then call sc.parallelize on 
your dataset?

-Andy

---
Regards,
Andy (Nam) Dang

On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu 
mailto:nicolae.maras...@adswizz.com>> wrote:

Hi,


When calling sc.parallelize(data,1), is there a preference where to put the 
data? I see 2 possibilities: sending it to a worker node, or keeping it on the 
driver program.


I would prefer to keep the data local to the driver. The use case is when I 
need just to load a bit of data from HBase, and then compute over it e.g. 
aggregate, using Spark.


Thanks,

Nicu

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Andy Dang

Can't you just load the data from HBase first, and then call sc.parallelize
on your dataset?

-Andy

---
Regards,
Andy (Nam) Dang

On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:

> Hi,
>
>
> When calling sc.parallelize(data,1), is there a preference where to put
> the data? I see 2 possibilities: sending it to a worker node, or keeping it
> on the driver program.
>
>
> I would prefer to keep the data local to the driver. The use case is when
> I need just to load a bit of data from HBase, and then compute over it e.g.
> aggregate, using Spark.
>
>
> Thanks,
>
> Nicu
>

sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu

Hi,


When calling sc.parallelize(data,1), is there a preference where to put the 
data? I see 2 possibilities: sending it to a worker node, or keeping it on the 
driver program.


I would prefer to keep the data local to the driver. The use case is when I 
need just to load a bit of data from HBase, and then compute over it e.g. 
aggregate, using Spark.


Thanks,

Nicu

Re: sc.parallelize with defaultParallelism=1

Re: sc.parallelize with defaultParallelism=1

Re: sc.parallelize with defaultParallelism=1

sc.parallelize with defaultParallelism=1

4 matches

Site Navigation

Mail list logo

Footer information