Regarding your first question, having more partitions than you do executors usually means you'll have better utilization, because the workload will be distributed more evenly. There's some degree of per-task overhead, but as long as you don't have a huge imbalance between number of tasks and number of executors that shouldn't be a large problem.
I don't really understand your second question. On Sat, Jul 11, 2015 at 5:00 AM, Shushant Arora <shushantaror...@gmail.com> wrote: > 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka > partitions in topic. Say I have 300 partitions in topic and 10 executors > and each with 3 cores so , is it means at a time only 10*3=30 partitions > are processed and then 30 like that since executors launch tasks per RDD > partitions , so I need in total; 300 tasks but since I have 30 cores(10 > executors each with 3 cores) so these tasks will execute 30 after 30 till > 300. > > So reducing no of kafka paartitions to say 100 will speed up the > processing? > > 2.In spark streaming job when I processed the kafka stream using foreachRDD > > directKafkaStream.foreachRDD(new function( public void call( vi)){ > v1.foreachPartition(new function(){public void call(){ > //..process partition > }}) > > }); > > since foreachRDD is operation so it spawns spark job but these jobs are > not coming on driver console like in map and print function as > > 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka > partitions in topic. Say I have 300 partitions in topic and 10 executors > and each with 3 cores so , is it means at a time only 10*3=30 partitions > are processed and then 30 like that since executors launch tasks per RDD > partitions , so I need in total; 300 tasks but since I have 30 cores(10 > executors each with 3 cores) so these tasks will execute 30 after 30 till > 300. > > So reducing no of kafka paartitions to say 100 will speed up the > processing? > > 2.In spark streaming job when I processed the kafka stream using foreachRDD > > directKafkaStream.foreachRDD(new function( public void call( vi)){ > v1.foreachPartition(new function(){public void call(){ > //..process partition > }}) > > }); > > since foreachRDD is operation so it spawns spark job but these jobs > timings are not coming on driver console like in map and print function as > > > ------------------------------------------- > Time: 1429054870000 ms > ------------------------------------------- > ------------------------------------------ > Time: 1429054871000 ms > ------------------------------------------- > > .................. > > Why is it so? > > > Thanks > Shushant > > > > > >