Hi,

Did  you try  another  implementation  of  DirectStream where you  give  only  
topic. It would  read all  topic partitions in parallel  under a batch  
interval . You  need  not create union explicitly. 


Sent from Samsung Mobile.

<div>-------- Original message --------</div><div>From: p pathiyil 
<[email protected]> </div><div>Date:11/02/2016  19:29  (GMT+05:30) 
</div><div>To: [email protected] </div><div>Cc:  </div><div>Subject: Spark 
Streaming with Kafka: Dealing with 'slow' partitions </div><div>
</div>Hi,

I am looking at a way to isolate the processing of messages from each Kafka 
partition within the same driver. 

Scenario: A DStream is created with the createDirectStream call by passing in a 
few partitions. Let us say that the streaming context is defined to have a time 
duration of 2 seconds. If the processing of messages from a single partition 
takes more than 2 seconds (while all the others finish much quicker), it seems 
that the next set of jobs get scheduled only after the processing of that last 
partition. This means that the delay is effective for all partitions and not 
just the partition that was truly the cause of the delay. What I would like to 
do is to have the delay only impact the 'slow' partition.

Tried to create one DStream per partition and then do a union of all 
partitions, (similar to the sample in 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
 but that didn't seem to help.

Please suggest the correct approach to solve this issue.

Thanks,
Praveen.

Reply via email to