Re: multiple concurrent jobs

Yadid Ayzenberg Tue, 19 Nov 2013 14:10:08 -0800

Assuming I also want to run n concurrent jobs of the following type:each RDD is of the same form (JavaPairRDD), and I would like to run thesame transformation on all RDDs.The brute force way would be to instantiate n threads and submit a jobfrom each thread.

Would this way be valid as well ? create a new RDD which is acombination of the n RDDs (something like a group by for multiple RDDs).

Is there a way to implement this using the existing java API ?


Yadid




On 11/19/13 12:20 PM, Mark Hamstra wrote:

No, it's my fault for not reading more carefully. We do use asomewhat overloaded and specialized lexicon to describe Spark, whichhelps when it is used uniformly, but penalizes those who leap tomisunderstanding. Prashant is correct that the largest granularitything that a user launches to do Spark work and that is associatedwith its own SparkContext is what we call an application. A job iswhat is launched by invoking a Spark action on an RDD. There can bemultiple jobs within an application, and those jobs are scheduledeither FIFO or with the fair scheduler. Going to to even smallergranularities, jobs can contain multiple stages (defined or broken upat shuffle boundaries), and stages are associated with task setscontaining multiple tasks, the units of work that actually run onworker nodes.
Anyway, Prashant's response about spreadOut is appropriate forapplication-level scheduling.
On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg <[email protected]<mailto:[email protected]>> wrote:
    My bad - I should have stated that up front. I guess it was kind
    of implicit within my question.

    Thanks for your help,

    Yadid



    On 11/19/13 10:59 AM, Mark Hamstra wrote:
    Ah, sorry -- misunderstood the question.


    On Nov 19, 2013, at 7:48 AM, Prashant Sharma
    <[email protected] <mailto:[email protected]>> wrote:
    I think that is Scheduling Within an Application, and he asked
    across apps. Actually spark standalone supports two ways of
    scheduling both are FIFO type.
    http://spark.incubator.apache.org/docs/latest/spark-standalone.html

    One is spread out mode and the other is use as fewer node as
    possible [1]

    1.
    
https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L383




    On Tue, Nov 19, 2013 at 9:02 PM, Mark Hamstra
    <[email protected] <mailto:[email protected]>> wrote:
    >>
    >> According to the documentation, spark standalone currently
    only supports a FIFO scheduling system.
    >
    >
    > That's not true.
    >
    > [sorry for the prior misfire]
    >
    >
    >
    > On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra
    <[email protected] <mailto:[email protected]>> wrote:
    >>
    >>
    >>
    >>
    >> On Tue, Nov 19, 2013 at 6:50 AM, Yadid Ayzenberg
    <[email protected] <mailto:[email protected]>> wrote:
    >>>
    >>> Hi all,
    >>>
    >>> According to the documentation, spark standalone currently
    only supports a FIFO scheduling system.
    >>> I understand its possible to limit the number of cores a job
    uses by setting spark.cores.max.
    >>> When running a job, will spark try using the max number of
    cores on each machine until it reaches the set limit, or will it
    do this round robin style - utilize a single core on each
    machine -  if its already used a core on all of the slaves and
    the limit has not been reached, spark will utilize an additional
    core on each machine and so on.
    >>>
    >>> I think the latter make more sense, but I want to be sure
    that is the case.
    >>>
    >>> Thanks,
    >>> Yadid
    >>>
    >>
    >



    --
    s

Re: multiple concurrent jobs

Reply via email to