Re: Kafka - streaming from multiple topics

Neelesh Mon, 21 Dec 2015 09:38:00 -0800

Thanks Cody.  My case is #2. Just wanted to confirm when you say different
spark jobs, do you mean one  spark-submit per topic, or just use different
threads in the driver to submit the job?


Thanks!

On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Spark streaming by default wont start the next batch until the current
> batch is completely done, even if only a few cores are still working.  This
> is generally a good thing, otherwise you'd have weird ordering issues.
>
> Each topicpartition is separate.
>
> Unbalanced partitions can happen either because 1. your hash while
> producing into kafka is bad (e.g. hash on customer id and 90% of your
> traffic is from one customer);  or 2. because you're doing a map over
> several topics with wildly different message rates.  If you're having the
> first problem, you just need to fix it.  If you're having the second
> problem, use different spark jobs for different topics.
>
> On Sun, Dec 20, 2015 at 2:28 PM, Neelesh <neele...@gmail.com> wrote:
>
>> @Chris,
>>     There is a 1-1 mapping b/w spark partitions & kafka partitions out of
>> the box . One can break it by repartitioning of course and add more
>> parallelism, but that has its own issues around consumer offset management-
>>  when do I commit the offsets, for example. While its trivial to increase
>> the number of Kafka partitions to achieve higher parallelism, it is
>> inherently static and cannot solve the problem of traffic bursts.  If it
>> were possible to repartition data from a single kafka partition and still
>> be able to handle consumer offset management when all child partitions that
>> make up the kafka partition succeeded/failed, it would solve the problem.
>>
>> @Cody, would love to hear your thoughts around this.
>>
>> On Sun, Dec 20, 2015 at 8:37 AM, Chris Fregly <ch...@fregly.com> wrote:
>>
>>> separating out your code into separate streaming jobs - especially when
>>> there are no dependencies between the jobs - is almost always the best
>>> route.  it's easier to combine atoms (fusion), then split them (fission).
>>>
>>> I recommend splitting out jobs along batch window, stream window, and
>>> state-tracking characteristics.
>>>
>>> for example, imagine 3 separate jobs for the following:
>>>
>>> 1) light storing of raw data into Cassandra (500ms batch interval)
>>> 2) medium aggregations/window roll ups (2000ms batch interval)
>>> 3) heavy training a ML model (10000ms batch interval).
>>>
>>> and reminder that you can control (isolate or combine) the spark
>>> resources used by these separate, single purpose streaming jobs using
>>> scheduler pools just like your batch spark jobs.
>>>
>>> @cody:  curious about neelesh's question, as well.  does the Kafka
>>> Direct Stream API treat each Kafka Topic Partition separately in terms of
>>> parallel retrieval?
>>>
>>> more context:  within a Kafka Topic partition, Kafka guarantees order,
>>> but not total ordering across partitions.  this is normal and expected.
>>>
>>> so I assume the the Kafka Direct Streaming connector can retrieve (and
>>> recover/retry) from separate partitions in parallel and still maintain the
>>> ordering guarantees offered by Kafka.
>>>
>>> if this is true, then I'd suggest @neelesh create more partitions within
>>> the Kafka Topic to improve parallelism - same as any distributed,
>>> partitioned data processing engine including spark.
>>>
>>> if this is not true, is there a technical limitation to prevent this
>>> parallelism within the connector?
>>>
>>> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote:
>>>
>>> A related issue -  When I put multiple topics in a single stream, the
>>> processing delay is as bad as the slowest task in the number of tasks
>>> created. Even though the topics are unrelated to each other, RDD at time
>>> "t1" has to wait for the RDD at "t0"  is fully executed,  even if most
>>> cores are idling, and  just one task is still running and the rest of them
>>> have completed. Effectively, a lightly loaded topic gets the worst deal
>>> because of a heavily loaded topic
>>>
>>> Is my understanding correct?
>>>
>>>
>>>
>>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> You could stick them all in a single stream, and do mapPartitions, then
>>>> switch on the topic for that partition.  It's probably cleaner to do
>>>> separate jobs, just depends on how you want to organize your code.
>>>>
>>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <
>>>> jpoca...@gmail.com> wrote:
>>>>
>>>>> Hi Cody,
>>>>>
>>>>> First of all thanks for the note about spark.streaming.concurrentJobs.
>>>>> I guess this is why it's not mentioned in the actual spark streaming doc.
>>>>> Since those 3 topics contain completely different data on which I need
>>>>> to apply different kind of transformations, I am not sure joining them
>>>>> would be really efficient, unless you know something that I don't.
>>>>>
>>>>> As I really don't need any interaction between those streams, I think
>>>>> I might end up running 3 different streaming apps instead of one.
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>>>> idea, as it allows the next batch to start processing before current one 
>>>>>> is
>>>>>> finished, which may have unintended consequences.
>>>>>>
>>>>>> Why can't you use a single stream with all the topics you care about,
>>>>>> or multiple streams if you're e.g. joining them?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>>>>
>>>>>>> Nevermind, I found the answer to my questions.
>>>>>>> The following spark configuration property will allow you to process
>>>>>>> multiple KafkaDirectStream in parallel:
>>>>>>> --conf spark.streaming.concurrentJobs=<something greater than 1>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> jean-pierre ocalan
>>>>> jpoca...@gmail.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Kafka - streaming from multiple topics

Reply via email to