Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

Cody Koeninger Thu, 11 Feb 2016 06:45:18 -0800

The real way to fix this is by changing partitioning, so you don't have a
hot partition.  It would be better to do this at the time you're producing
messages, but you can also do it with a shuffle / repartition during
consuming.


There is a setting to allow another batch to start in parallel, but that's
likely to have unintended consequences.

On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil <[email protected]> wrote:

> Hi,
>
> I am looking at a way to isolate the processing of messages from each
> Kafka partition within the same driver.
>
> Scenario: A DStream is created with the createDirectStream call by passing
> in a few partitions. Let us say that the streaming context is defined to
> have a time duration of 2 seconds. If the processing of messages from a
> single partition takes more than 2 seconds (while all the others finish
> much quicker), it seems that the next set of jobs get scheduled only after
> the processing of that last partition. This means that the delay is
> effective for all partitions and not just the partition that was truly the
> cause of the delay. What I would like to do is to have the delay only
> impact the 'slow' partition.
>
> Tried to create one DStream per partition and then do a union of all
> partitions, (similar to the sample in
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
> but that didn't seem to help.
>
> Please suggest the correct approach to solve this issue.
>
> Thanks,
> Praveen.
>

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

Reply via email to