Assuming I also want to run n concurrent jobs of the following type:
each RDD is of the same form (JavaPairRDD), and I would like to run the
same transformation on all RDDs.
The brute force way would be to instantiate n threads and submit a job
from each thread.
Would this way be valid as well ? create a new RDD which is a
combination of the n RDDs (something like a group by for multiple RDDs).
Is there a way to implement this using the existing java API ?
Yadid
On 11/19/13 12:20 PM, Mark Hamstra wrote:
No, it's my fault for not reading more carefully. We do use a
somewhat overloaded and specialized lexicon to describe Spark, which
helps when it is used uniformly, but penalizes those who leap to
misunderstanding. Prashant is correct that the largest granularity
thing that a user launches to do Spark work and that is associated
with its own SparkContext is what we call an application. A job is
what is launched by invoking a Spark action on an RDD. There can be
multiple jobs within an application, and those jobs are scheduled
either FIFO or with the fair scheduler. Going to to even smaller
granularities, jobs can contain multiple stages (defined or broken up
at shuffle boundaries), and stages are associated with task sets
containing multiple tasks, the units of work that actually run on
worker nodes.
Anyway, Prashant's response about spreadOut is appropriate for
application-level scheduling.
On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg <[email protected]
<mailto:[email protected]>> wrote:
My bad - I should have stated that up front. I guess it was kind
of implicit within my question.
Thanks for your help,
Yadid
On 11/19/13 10:59 AM, Mark Hamstra wrote:
Ah, sorry -- misunderstood the question.
On Nov 19, 2013, at 7:48 AM, Prashant Sharma
<[email protected] <mailto:[email protected]>> wrote:
I think that is Scheduling Within an Application, and he asked
across apps. Actually spark standalone supports two ways of
scheduling both are FIFO type.
http://spark.incubator.apache.org/docs/latest/spark-standalone.html
One is spread out mode and the other is use as fewer node as
possible [1]
1.
https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L383
On Tue, Nov 19, 2013 at 9:02 PM, Mark Hamstra
<[email protected] <mailto:[email protected]>> wrote:
>>
>> According to the documentation, spark standalone currently
only supports a FIFO scheduling system.
>
>
> That's not true.
>
> [sorry for the prior misfire]
>
>
>
> On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra
<[email protected] <mailto:[email protected]>> wrote:
>>
>>
>>
>>
>> On Tue, Nov 19, 2013 at 6:50 AM, Yadid Ayzenberg
<[email protected] <mailto:[email protected]>> wrote:
>>>
>>> Hi all,
>>>
>>> According to the documentation, spark standalone currently
only supports a FIFO scheduling system.
>>> I understand its possible to limit the number of cores a job
uses by setting spark.cores.max.
>>> When running a job, will spark try using the max number of
cores on each machine until it reaches the set limit, or will it
do this round robin style - utilize a single core on each
machine - if its already used a core on all of the slaves and
the limit has not been reached, spark will utilize an additional
core on each machine and so on.
>>>
>>> I think the latter make more sense, but I want to be sure
that is the case.
>>>
>>> Thanks,
>>> Yadid
>>>
>>
>
--
s