Re: Running Spark in local mode seems to ignore local[N]

Sean Owen Mon, 11 May 2015 14:36:06 -0700

You might have a look at the Spark docs to start. 1 batch = 1 RDD, but
1 RDD can have many partitions. And should, for scale. You do not
submit multiple jobs to get parallelism.


The number of partitions in a streaming RDD is determined by the block
interval and the batch interval. If you have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.

On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
<dgoldenberg...@gmail.com> wrote:
> Understood. We'll use the multi-threaded code we already have..
>
> How are these execution slots filled up? I assume each slot is dedicated to
> one submitted task.  If that's the case, how is each task distributed then,
> i.e. how is that task run in a multi-node fashion?  Say 1000 batches/RDD's
> are extracted out of Kafka, how does that relate to the number of executors
> vs. task slots?
>
> Presumably we can fill up the slots with multiple instances of the same
> task... How do we know how many to launch?
>
> On Mon, May 11, 2015 at 5:20 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> BTW I think my comment was wrong as marcelo demonstrated. In
>> standalone mode you'd have one worker, and you do have one executor,
>> but his explanation is right. But, you certainly have execution slots
>> for each core.
>>
>> Are you talking about your own user code? you can make threads, but
>> that's nothing do with Spark then. If you run code on your driver,
>> it's not distributed. If you run Spark over an RDD with 1 partition,
>> only one task works on it.
>>
>> On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
>> <dgoldenberg...@gmail.com> wrote:
>> > Sean,
>> >
>> > How does this model actually work? Let's say we want to run one job as N
>> > threads executing one particular task, e.g. streaming data out of Kafka
>> > into
>> > a search engine.  How do we configure our Spark job execution?
>> >
>> > Right now, I'm seeing this job running as a single thread. And it's
>> > quite a
>> > bit slower than just running a simple utility with a thread executor
>> > with a
>> > thread pool of N threads doing the same task.
>> >
>> > The performance I'm seeing of running the Kafka-Spark Streaming job is 7
>> > times slower than that of the utility.  What's pulling Spark back?
>> >
>> > Thanks.
>> >
>> >
>> > On Mon, May 11, 2015 at 4:55 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> You have one worker with one executor with 32 execution slots.
>> >>
>> >> On Mon, May 11, 2015 at 9:52 PM, dgoldenberg <dgoldenberg...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Is there anything special one must do, running locally and submitting
>> >> > a
>> >> > job
>> >> > like so:
>> >> >
>> >> > spark-submit \
>> >> >         --class "com.myco.Driver" \
>> >> >         --master local[*]  \
>> >> >         ./lib/myco.jar
>> >> >
>> >> > In my logs, I'm only seeing log messages with the thread identifier
>> >> > of
>> >> > "Executor task launch worker-0".
>> >> >
>> >> > There are 4 cores on the machine so I expected 4 threads to be at
>> >> > play.
>> >> > Running with local[32] did not yield 32 worker threads.
>> >> >
>> >> > Any recommendations? Thanks.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
>> >> > Sent from the Apache Spark User List mailing list archive at
>> >> > Nabble.com.
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> > For additional commands, e-mail: user-h...@spark.apache.org
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

Reply via email to