The real way to fix this is by changing partitioning, so you don't have a hot partition. It would be better to do this at the time you're producing messages, but you can also do it with a shuffle / repartition during consuming.
There is a setting to allow another batch to start in parallel, but that's likely to have unintended consequences. On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil <[email protected]> wrote: > Hi, > > I am looking at a way to isolate the processing of messages from each > Kafka partition within the same driver. > > Scenario: A DStream is created with the createDirectStream call by passing > in a few partitions. Let us say that the streaming context is defined to > have a time duration of 2 seconds. If the processing of messages from a > single partition takes more than 2 seconds (while all the others finish > much quicker), it seems that the next set of jobs get scheduled only after > the processing of that last partition. This means that the delay is > effective for all partitions and not just the partition that was truly the > cause of the delay. What I would like to do is to have the delay only > impact the 'slow' partition. > > Tried to create one DStream per partition and then do a union of all > partitions, (similar to the sample in > http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times), > but that didn't seem to help. > > Please suggest the correct approach to solve this issue. > > Thanks, > Praveen. >
