Re: Maximum Core Utilization

2015-05-05 Thread ayan guha
Also, if not already done,  you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote:

 Hi All,
 For a job I am running on Spark with a dataset of say 350,000 lines (not
 big), I am finding that even though my cluster has a large number of cores
 available (like 100 cores), the Spark system seems to stop after using just
 4 cores and after that the runtime is pretty much a straight line no matter
 how many more cores are thrown at it. I am wondering if Spark tries to
 figure out the maximum no. of cores to use based on the size of the
 dataset? If yes, is there a way to disable this feature and force it to use
 all the cores available?

 Thanks,
 Manu

 --

 The greater danger for most of us lies not in setting our aim too high and
 falling short; but in setting our aim too low, and achieving our mark.
 - Michelangelo



Maximum Core Utilization

2015-05-05 Thread Manu Kaul
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large number of cores
available (like 100 cores), the Spark system seems to stop after using just
4 cores and after that the runtime is pretty much a straight line no matter
how many more cores are thrown at it. I am wondering if Spark tries to
figure out the maximum no. of cores to use based on the size of the
dataset? If yes, is there a way to disable this feature and force it to use
all the cores available?

Thanks,
Manu

-- 

The greater danger for most of us lies not in setting our aim too high and
falling short; but in setting our aim too low, and achieving our mark.
- Michelangelo


Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher
Hi,

do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.

There are a few settings you could play with, assuming your issue is
related to the above:
spark.default.parallelism
spark.cores.max
spark.task.cpus

On Tue, May 5, 2015 at 3:55 PM, Manu Kaul manohar.k...@gmail.com wrote:

 Hi All,
 For a job I am running on Spark with a dataset of say 350,000 lines (not
 big), I am finding that even though my cluster has a large number of cores
 available (like 100 cores), the Spark system seems to stop after using just
 4 cores and after that the runtime is pretty much a straight line no matter
 how many more cores are thrown at it. I am wondering if Spark tries to
 figure out the maximum no. of cores to use based on the size of the
 dataset? If yes, is there a way to disable this feature and force it to use
 all the cores available?

 Thanks,
 Manu

 --

 The greater danger for most of us lies not in setting our aim too high and
 falling short; but in setting our aim too low, and achieving our mark.
 - Michelangelo