Re: Plan on Structured Streaming in next major/minor release?

kant kodali Sat, 20 Oct 2018 22:22:21 -0700

+1 For Raising all this.
+1 For Queryable State (SPARK-16738 [3])

On Thu, Oct 18, 2018 at 9:59 PM Jungtaek Lim <kabh...@gmail.com> wrote:


> Small correction: "timeout" in map/flatmapGroupsWithState would not work
> similar as State TTL when event time and watermark is set. So timeout in
> map/flatmapGroupsWithState is to guarantee removal of state when the state
> will not be used, as similar as what we do with streaming aggregation,
> whereas State TTL is just work as its name is represented
> (self-explanatory). Hence State TTL looks valid for all the cases.
>
> 2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>
>> Hi devs,
>>
>> While Spark 2.4.0 is still in progress of release votes, I'm seeing some
>> pull requests on non-SS are being reviewed and merged into master branch,
>> so I guess discussion about next release is OK.
>>
>> Looks like there's a major TODO left on structured streaming: allowing
>> stateful operation in continuous mode (watermark, stateful exactly-once)
>> and no other major milestone is shared. (Please let me know if I'm missing
>> here!) As a structured streaming contributor's point of view, there're
>> another features we could discuss and see which are good to have, and
>> prioritize if possible (NOTE: just a brainstorming and some items might not
>> be valid for structured streaming):
>>
>> * Native support on session window (SPARK-10816 [1])
>>   ** patch available
>> * Support delegation token on Kafka (SPARK-25501 [2])
>>   ** patch available
>> * Queryable State (SPARK-16738 [3])
>>   ** some discussion took place, but no action is taken yet
>> * End to end exactly-once with Kafka sink
>>   ** given Kafka is the first class on streaming source/sink nowadays
>> * Custom window / custom watermark
>> * Physically scale (up/down) streaming state
>> * State TTL (especially for non-watermark state)
>>   ** "timeout" in map/flatmapGroupsWithState fits it, but just to check
>> whether we want to have it for normal streaming aggregation
>> * Provide discarded events due to late via side output or similar feature
>>   ** for me it looks like tricky one, since Spark's RDD as well as SQL
>> semantic provide one output
>> * more?
>>
>> Would like to hear others opinions about this. Please also share if
>> there're ongoing efforts on other items for structured streaming. Happy to
>> help out if it needs another hand.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-10816
>> 2. https://issues.apache.org/jira/browse/SPARK-25501
>> 3. https://issues.apache.org/jira/browse/SPARK-16738
>>
>>

Re: Plan on Structured Streaming in next major/minor release?

Reply via email to