[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724038#comment-14724038
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

Good questions Cody. 

When adding a topic after streaming context, we should at a minimum be able to 
start consumption from the beggining or end of each topic partition. When a 
topic is removed from subscription, no offsets should be retained. When it is 
added later, there is no difference from a brand new topic and the same options 
(beginning, end or specific offset) are available.

When the driver restarts, for all existing topics, the consumption should 
restart from the saved offsets by default, but jobs should have the flexibility 
to choose different consumption points (start, end, specific offset).  If you 
restart the job and specify a new offset, that is where consumption should 
start, in effect overriding any saved offsets.

Topics can be repartitioned in Kafka today. So we need to handle partition 
count increase or decrease even in the absence of dynamic topic registration in 
Spark Streaming. How is this handled? I expect the same solution to carry over.

The topic changes happen in the same thread of execution where the initial list 
of topics was provided before starting the streaming context. I'm not sure of 
the implication of doing it in the on batch completed handler.

> Support new topic subscriptions without requiring restart of the streaming 
> context
> --
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724074#comment-14724074
 ] 

Cody Koeninger commented on SPARK-10320:


" If you restart the job and specify a new offset, that is where consumption 
should start, in effect overriding any saved offsets."

That's not the way checkpoints work.  You're either restarting from a 
checkpoint, or you're not restarting from a checkpoint, the decision is up to 
you.  If you want to specify a new offset, start the job clean.

"The topic changes happen in the same thread of execution where the initial 
list of topics was provided before starting the streaming context."

Can you say a little more about what you're actually doing here?  How do you 
know when topics need to be modified?  Typically streaming jobs just call 
ssc.awaitTermination in their main thread, which seems incompatible with what 
you're describing.

> Support new topic subscriptions without requiring restart of the streaming 
> context
> --
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717220#comment-14717220
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

There is ingest-time analytics (independent, application of transforms over 
data published to individual topics) and query-time analytics (user queries 
which requires joins across RDDs holding the transformed data). However, even 
ingest-time analytics will potentially require joins across data published to 
different topics. For these reasons, this needs to be a single Spark streaming 
application.

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717184#comment-14717184
 ] 

Sean Owen commented on SPARK-10320:
---

It sounds like you listen to topics and processing them fairly independently 
(as you should). Why not run multiple streaming apps? sure you incur some 
overhead, but gain isolation and simplicity.

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717342#comment-14717342
 ] 

Cody Koeninger commented on SPARK-10320:


As I said on the list, the best way to deal with this currently is start a new 
app with your new code, before stopping the old app.

In terms of a potential feature addition, I think there are a number of 
questions that would need to be cleared up... e.g.

- when would you change topics?  During a streaming listener onbatch completed 
handler?  From a separate thread?

- when adding a topic, what would the expectations around starting offset be?  
As in the current api, provide explicit offsets per partition, start at 
beginning, or start at end?

- if you add partitions for topics that currently exist, and specify a starting 
offset that's different from where the job is currently, what would the 
expectation be?
- if you add, later remove, then later re-add a topic, what would the 
expectation regarding saved checkpoints be?

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org