GitHub user sidhavratha opened a pull request:

    https://github.com/apache/spark/pull/21685

    [SPARK-24707][DSTREAMS] Enable spark-kafka-streaming to maintain min …

    …buffer using async thread to avoid blocking kafka poll
    
    ## What changes were proposed in this pull request?
    
    Currently Spark Kafka RDD will block on kafka consumer poll. Specially in 
Spark-Kafka-streaming job this poll duration adds into batch processing time 
which result in 
        * Increased batch processing time (which is apart from time taken to 
process records)
        * Results in unpredictable batch processing time based on poll time.
    
    This PR consists of changes to maintain min records in buffer, so that 
streaming batches processing do not have to get blocked on kafka poll.
    
    ## How was this patch tested?
    
    Unit test / manual test.
    
[Before_change.pdf](https://github.com/apache/spark/files/2152353/Before_change.pdf)
    
[After_change_2000_buffer_per_part.pdf](https://github.com/apache/spark/files/2152354/After_change_2000_buffer_per_part.pdf)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sidhavratha/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21685.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21685
    
----
commit 35d792a83f13291a99cd1bf3ce89f932614da9c0
Author: s0k00rv <sidhavratha.kumar@...>
Date:   2018-07-01T03:00:45Z

    [SPARK-24707][DSTREAMS] Enable spark-kafka-streaming to maintain min buffer 
using async thread to avoid blocking kafka poll

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to