Hi, Did you try another implementation of DirectStream where you give only topic. It would read all topic partitions in parallel under a batch interval . You need not create union explicitly.
Sent from Samsung Mobile. <div>-------- Original message --------</div><div>From: p pathiyil <[email protected]> </div><div>Date:11/02/2016 19:29 (GMT+05:30) </div><div>To: [email protected] </div><div>Cc: </div><div>Subject: Spark Streaming with Kafka: Dealing with 'slow' partitions </div><div> </div>Hi, I am looking at a way to isolate the processing of messages from each Kafka partition within the same driver. Scenario: A DStream is created with the createDirectStream call by passing in a few partitions. Let us say that the streaming context is defined to have a time duration of 2 seconds. If the processing of messages from a single partition takes more than 2 seconds (while all the others finish much quicker), it seems that the next set of jobs get scheduled only after the processing of that last partition. This means that the delay is effective for all partitions and not just the partition that was truly the cause of the delay. What I would like to do is to have the delay only impact the 'slow' partition. Tried to create one DStream per partition and then do a union of all partitions, (similar to the sample in http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times), but that didn't seem to help. Please suggest the correct approach to solve this issue. Thanks, Praveen.
