subject:"Parallelize on spark context"

Re: Parallelize on spark context

2014-11-07 Thread _soumya_

Naveen, 
 Don't be worried - you're not the only one to be bitten by this. A little
inspection of the Javadoc told me you have this other option: 

JavaRDDInteger distData = sc.parallelize(data, 100);

-- Now the RDD is split into 100 partitions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-on-spark-context-tp18327p18381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala

Hi,

JavaRDDInteger distData = sc.parallelize(data);

On what basis parallelize splits the data into multiple datasets. How to handle 
if we want these many datasets to be executed per executor?

For example, my data is of 1000 integers list and I am having 2 node yarn 
cluster. It is diving into 2 batches of 500 size.

Regards,
Naveen.

RE: Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala

Hi,

In the documentation is I found something like this.

spark.default.parallelism

· Local mode: number of cores on the local machine
· Mesos fine grained mode: 8
· Others: total number of cores on all executor nodes or 2, whichever 
is larger


I am using 2 node cluster with 48 cores(24+24). As per above no of data sets 
should be 1000/48=20.83, can be around 20 or 21.

But it is dividing into 2 sets of each 500 size.

I have used the function sc.parallelize(data, 10). But 10 datasets of size 100. 
8 datasets executing on one node  and 2 datasets on another node.

How to check how many cores are running to complete task of 8 datasets?(Is 
there any commands or UI to check that)

Regards,
Naveen.


From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of 
Holden Karau
Sent: Friday, November 07, 2014 12:46 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Parallelize on spark context

Hi Naveen,

So by default when we call parallelize it will be parallelized by the default 
number (which we can control with the property spark.default.parallelism) or if 
we just want a specific instance of parallelize to have a different number of 
partitions, we can instead call sc.parallelize(data, numpartitions). The 
default value of this is documented in 
http://spark.apache.org/docs/latest/configuration.html#spark-properties

Cheers,

Holden :)

On Thu, Nov 6, 2014 at 10:43 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
Hi,

JavaRDDInteger distData = sc.parallelize(data);

On what basis parallelize splits the data into multiple datasets. How to handle 
if we want these many datasets to be executed per executor?

For example, my data is of 1000 integers list and I am having 2 node yarn 
cluster. It is diving into 2 batches of 500 size.

Regards,
Naveen.



--
Cell : 425-233-8271

Re: Parallelize on spark context

Parallelize on spark context

RE: Parallelize on spark context

3 matches

Site Navigation

Mail list logo

Footer information