You might have a look at the Spark docs to start. 1 batch = 1 RDD, but 1 RDD can have many partitions. And should, for scale. You do not submit multiple jobs to get parallelism.
The number of partitions in a streaming RDD is determined by the block interval and the batch interval. If you have a batch interval of 10s and block interval of 1s you'll get 10 partitions of data in the RDD. On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > Understood. We'll use the multi-threaded code we already have.. > > How are these execution slots filled up? I assume each slot is dedicated to > one submitted task. If that's the case, how is each task distributed then, > i.e. how is that task run in a multi-node fashion? Say 1000 batches/RDD's > are extracted out of Kafka, how does that relate to the number of executors > vs. task slots? > > Presumably we can fill up the slots with multiple instances of the same > task... How do we know how many to launch? > > On Mon, May 11, 2015 at 5:20 PM, Sean Owen <so...@cloudera.com> wrote: >> >> BTW I think my comment was wrong as marcelo demonstrated. In >> standalone mode you'd have one worker, and you do have one executor, >> but his explanation is right. But, you certainly have execution slots >> for each core. >> >> Are you talking about your own user code? you can make threads, but >> that's nothing do with Spark then. If you run code on your driver, >> it's not distributed. If you run Spark over an RDD with 1 partition, >> only one task works on it. >> >> On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg >> <dgoldenberg...@gmail.com> wrote: >> > Sean, >> > >> > How does this model actually work? Let's say we want to run one job as N >> > threads executing one particular task, e.g. streaming data out of Kafka >> > into >> > a search engine. How do we configure our Spark job execution? >> > >> > Right now, I'm seeing this job running as a single thread. And it's >> > quite a >> > bit slower than just running a simple utility with a thread executor >> > with a >> > thread pool of N threads doing the same task. >> > >> > The performance I'm seeing of running the Kafka-Spark Streaming job is 7 >> > times slower than that of the utility. What's pulling Spark back? >> > >> > Thanks. >> > >> > >> > On Mon, May 11, 2015 at 4:55 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> You have one worker with one executor with 32 execution slots. >> >> >> >> On Mon, May 11, 2015 at 9:52 PM, dgoldenberg <dgoldenberg...@gmail.com> >> >> wrote: >> >> > Hi, >> >> > >> >> > Is there anything special one must do, running locally and submitting >> >> > a >> >> > job >> >> > like so: >> >> > >> >> > spark-submit \ >> >> > --class "com.myco.Driver" \ >> >> > --master local[*] \ >> >> > ./lib/myco.jar >> >> > >> >> > In my logs, I'm only seeing log messages with the thread identifier >> >> > of >> >> > "Executor task launch worker-0". >> >> > >> >> > There are 4 cores on the machine so I expected 4 threads to be at >> >> > play. >> >> > Running with local[32] did not yield 32 worker threads. >> >> > >> >> > Any recommendations? Thanks. >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html >> >> > Sent from the Apache Spark User List mailing list archive at >> >> > Nabble.com. >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> > For additional commands, e-mail: user-h...@spark.apache.org >> >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org