Thanks Cody. My case is #2. Just wanted to confirm when you say different spark jobs, do you mean one spark-submit per topic, or just use different threads in the driver to submit the job?
Thanks! On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger <c...@koeninger.org> wrote: > Spark streaming by default wont start the next batch until the current > batch is completely done, even if only a few cores are still working. This > is generally a good thing, otherwise you'd have weird ordering issues. > > Each topicpartition is separate. > > Unbalanced partitions can happen either because 1. your hash while > producing into kafka is bad (e.g. hash on customer id and 90% of your > traffic is from one customer); or 2. because you're doing a map over > several topics with wildly different message rates. If you're having the > first problem, you just need to fix it. If you're having the second > problem, use different spark jobs for different topics. > > On Sun, Dec 20, 2015 at 2:28 PM, Neelesh <neele...@gmail.com> wrote: > >> @Chris, >> There is a 1-1 mapping b/w spark partitions & kafka partitions out of >> the box . One can break it by repartitioning of course and add more >> parallelism, but that has its own issues around consumer offset management- >> when do I commit the offsets, for example. While its trivial to increase >> the number of Kafka partitions to achieve higher parallelism, it is >> inherently static and cannot solve the problem of traffic bursts. If it >> were possible to repartition data from a single kafka partition and still >> be able to handle consumer offset management when all child partitions that >> make up the kafka partition succeeded/failed, it would solve the problem. >> >> @Cody, would love to hear your thoughts around this. >> >> On Sun, Dec 20, 2015 at 8:37 AM, Chris Fregly <ch...@fregly.com> wrote: >> >>> separating out your code into separate streaming jobs - especially when >>> there are no dependencies between the jobs - is almost always the best >>> route. it's easier to combine atoms (fusion), then split them (fission). >>> >>> I recommend splitting out jobs along batch window, stream window, and >>> state-tracking characteristics. >>> >>> for example, imagine 3 separate jobs for the following: >>> >>> 1) light storing of raw data into Cassandra (500ms batch interval) >>> 2) medium aggregations/window roll ups (2000ms batch interval) >>> 3) heavy training a ML model (10000ms batch interval). >>> >>> and reminder that you can control (isolate or combine) the spark >>> resources used by these separate, single purpose streaming jobs using >>> scheduler pools just like your batch spark jobs. >>> >>> @cody: curious about neelesh's question, as well. does the Kafka >>> Direct Stream API treat each Kafka Topic Partition separately in terms of >>> parallel retrieval? >>> >>> more context: within a Kafka Topic partition, Kafka guarantees order, >>> but not total ordering across partitions. this is normal and expected. >>> >>> so I assume the the Kafka Direct Streaming connector can retrieve (and >>> recover/retry) from separate partitions in parallel and still maintain the >>> ordering guarantees offered by Kafka. >>> >>> if this is true, then I'd suggest @neelesh create more partitions within >>> the Kafka Topic to improve parallelism - same as any distributed, >>> partitioned data processing engine including spark. >>> >>> if this is not true, is there a technical limitation to prevent this >>> parallelism within the connector? >>> >>> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote: >>> >>> A related issue - When I put multiple topics in a single stream, the >>> processing delay is as bad as the slowest task in the number of tasks >>> created. Even though the topics are unrelated to each other, RDD at time >>> "t1" has to wait for the RDD at "t0" is fully executed, even if most >>> cores are idling, and just one task is still running and the rest of them >>> have completed. Effectively, a lightly loaded topic gets the worst deal >>> because of a heavily loaded topic >>> >>> Is my understanding correct? >>> >>> >>> >>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> >>> wrote: >>> >>>> You could stick them all in a single stream, and do mapPartitions, then >>>> switch on the topic for that partition. It's probably cleaner to do >>>> separate jobs, just depends on how you want to organize your code. >>>> >>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN < >>>> jpoca...@gmail.com> wrote: >>>> >>>>> Hi Cody, >>>>> >>>>> First of all thanks for the note about spark.streaming.concurrentJobs. >>>>> I guess this is why it's not mentioned in the actual spark streaming doc. >>>>> Since those 3 topics contain completely different data on which I need >>>>> to apply different kind of transformations, I am not sure joining them >>>>> would be really efficient, unless you know something that I don't. >>>>> >>>>> As I really don't need any interaction between those streams, I think >>>>> I might end up running 3 different streaming apps instead of one. >>>>> >>>>> Thanks again! >>>>> >>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good >>>>>> idea, as it allows the next batch to start processing before current one >>>>>> is >>>>>> finished, which may have unintended consequences. >>>>>> >>>>>> Why can't you use a single stream with all the topics you care about, >>>>>> or multiple streams if you're e.g. joining them? >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote: >>>>>> >>>>>>> Nevermind, I found the answer to my questions. >>>>>>> The following spark configuration property will allow you to process >>>>>>> multiple KafkaDirectStream in parallel: >>>>>>> --conf spark.streaming.concurrentJobs=<something greater than 1> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com <http://nabble.com>. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> jean-pierre ocalan >>>>> jpoca...@gmail.com >>>>> >>>> >>>> >>> >> >