[jira] [Created] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

Tathagata Das (JIRA) Wed, 28 Feb 2018 17:28:36 -0800

Tathagata Das created SPARK-23541:
-------------------------------------

             Summary: Allow Kafka source to read data with greater parallelism 
than the number of topic-partitions
                 Key: SPARK-23541
                 URL: https://issues.apache.org/jira/browse/SPARK-23541
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Tathagata Das
            Assignee: Tathagata Das



Currently, when the Kafka source reads from Kafka, it generates as many tasks 
as the number of partitions in the topic(s) to be read. In some case, it may be 
beneficial to read the data with greater parallelism, that is, with more number 
partitions/tasks. That means, offset ranges must be divided up into smaller 
ranges such the number of records in partition ~= total records in batch / 
desired partitions. This would also balance out any data skews between 
topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions

Reply via email to