[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584515#comment-15584515
 ] 

Apache Spark commented on SPARK-17813:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15527

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577037#comment-15577037
 ] 

Cody Koeninger commented on SPARK-17813:


To be clear, the current direct stream (and as a result structured stream) 
straight up will not work with compacted topics currently, because of the 
assumption that offset ranges are contiguous.  There's a ticket for it 
SPARK-17147 with a prototype solution, waiting for feedback from a user on it.

So for global maxOffsetsPerTrigger are you saying a spark configuration?  Is 
there a reason not to make that a maxRowsPerTrigger (or messages, or whatever 
name) so that it can potentially be reused by other sources?  I think for this 
a proportional distribution of offsets shouldn't be too hard.  I can pick this 
up once the assign stuff is stabilized.

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-14 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576826#comment-15576826
 ] 

Michael Armbrust commented on SPARK-17813:
--

I think its okay to ignore compacted topics, at least initially.  You would 
still respect the "maximum" nature of the configuration, though would waste 
some effort scheduling tasks smaller than the max.

I would probably start simple and just have a global {{maxOffsetsPerTrigger}} 
that bounds the total number of records in each batch and is distributed 
amongst the topic partitions.  topicpartitions that are skewed too small will 
not have enough offsets available and we can spill that over to the ones that 
are skewed large.  We can always add something more complicated in the future.

An alternative proposal would be to spread out the max to each partition 
proportional to the total number of offsets available when planning.

Regarding [SPARK-17510], I would make this configuration an option to the 
DataStreamReader, you'd be able to configure it perstream instead of globally.  
So, I think we are good.

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573806#comment-15573806
 ] 

Cody Koeninger commented on SPARK-17813:


So issues to be worked out here (assuming we're still ignoring compacted topics)

maxOffsetsPerTrigger - how are these maximums distributed among partitions?  
What about skewed topics / partitions?

maxOffsetsPerTopicPartitionPerTrigger - (this isn't just hypothetical, e.g. 
SPARK-17510) If we do this, how is this configuration communicated? 

{noformat}
option("maxOffsetsPerTopicPartitionPerTrigger", """{"topicFoo":{"0":600}, 
"topicBar":{"0":300, "1": 600}}""")
{noformat}

{noformat}
option("maxOffsetsPerTopicPerTrigger", """{"topicFoo": 600, "topicBar": 300}""")
{noformat}



> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org