Heads-up: I filed SPARK-39805 to perform deprecation of Trigger.Once in
Spark 3.4.0.

Regarding Adam's use case: although there is no good solution to this, it
is fairly easy to cover the specific use case of Trigger.Once via having a
flag in Trigger.AvailableNow to enforce processing all available data in a
single microbatch. While this can behave the same with Trigger.Once on
processing new available data (watermark advancement happens after
processing all the data), this can also handle previous uncommitted
batch(es) as well as no-data batch.

On Tue, Jul 12, 2022 at 9:43 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Final reminder. I'll leave this thread for a couple of days to see further
> voices, and go forward if there is no outstanding comment.
>
> On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> It sounds like none of the approaches perfectly solve the issue of
>> backfill.
>>
>> 1. Trigger.Once: scale issue
>> 2. Trigger.AvailbleNow: watermark advancement issue (data getting dropped
>> due to watermark) depending on the order of data
>> 3. Manual batch: state is not built from processing backfill
>>
>> Handling a huge data (backfill) with a single microbatch without
>> advancing the watermark also requires thinking of "backfill-specific"
>> situations - state can grow unexpectedly since there is no way to purge
>> without watermark advancement. There seems to be not really a good approach
>> to solve all of the issues smoothly. One easier way as of now is to use
>> RocksDB state store provider to tolerate the huge size of state while we
>> enforce to not advance watermark, but the ideal approach still really
>> depends on the data source and the volume of the data to backfill.
>>
>> Btw, don't worry if you get a feeling the deprecated API may get removed
>> too soon! Removing the API would require another serious discussion and
>> Spark community is generally not in favor of removing existing API.
>>
>> 2022년 7월 8일 (금) 오후 11:21, Adam Binford <adam...@gmail.com>님이 작성:
>>
>> Dang I was hoping it was the second one. In our case the data is too
>>> large to run the whole backfill for the aggregation in a single batch (the
>>> shuffle is too big). We currently resort to manually batching (i.e. not
>>> streaming) the backlog (anything older than the watermark) when we need to
>>> reprocess, because we can't really know for sure our batches are processed
>>> in the correct event time order when starting from scratch.
>>>
>>> I'm not against deprecating Trigger.Once, just wanted to chime in that
>>> someone was using it! I'm itching to upgrade and try out the new stuff.
>>>
>>> Adam
>>>
>>> On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Thanks for the input, Adam! Replying inline.
>>>>
>>>> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford <adam...@gmail.com> wrote:
>>>>
>>>>> We use Trigger.Once a lot, usually for backfilling data for new
>>>>> streams. I feel like I could see a continuing use case for "ignore trigger
>>>>> limits for this batch" (ignoring the whole issue with re-running the last
>>>>> failed batch vs a new batch), but we haven't actually been able to upgrade
>>>>> yet and try out Trigger.AvailableNow, so that could end up replacing all
>>>>> our use cases.
>>>>>
>>>>> One question I did have is how it does (or is supposed to) handle
>>>>> watermarking. Is the watermark determined for each batch independently 
>>>>> like
>>>>> a normal stream, or is it kept constant for all batches in a single
>>>>> AvailableNow run? For example, we have a stateful job that we need to 
>>>>> rerun
>>>>> occasionally, and it takes ~6 batches to backfill all the data before
>>>>> catching up to live data. With a Trigger.Once we know we won't 
>>>>> accidentally
>>>>> drop any data due to the watermark when backfilling, because it's a single
>>>>> batch with no watermark yet. Would the same hold true if we backfill with
>>>>> Trigger.AvailableNow instead?
>>>>>
>>>>
>>>> The behavior is the former one. Each batch advances the watermark and
>>>> it's immediately reflected on the next batch.
>>>>
>>>> The number of batches Trigger.AvailableNow will execute depends on the
>>>> data source and the source option. For example, if you use Kafka data
>>>> source and use Trigger.AvailableNow without specifying any source option on
>>>> limiting the size, Trigger.AvailableNow will process all newly available
>>>> data as a single microbatch. It may not be still a single microbatch - it
>>>> would also handle the batch already logged in WAL first if any, as well as
>>>> handle no-data batch after the run of all microbatches. But I guess these
>>>> additional batches wouldn't hurt your case.
>>>>
>>>> If the data source doesn't allow processing all available data within a
>>>> single microbatch (depending on the implementation of default read limit),
>>>> you could probably either 1) set source options regarding to limit size as
>>>> an unrealistic one to enforce a single batch or 2) set the delay of
>>>> watermark as an unrealistic one. Both of the workarounds require you to use
>>>> different source options/watermark configuration for backfill vs normal run
>>>> - I agree it wouldn’t be a smooth one.
>>>>
>>>> This proposal does not aim to remove Trigger.Once in near future. As
>>>> long as we deprecate Trigger.Once, we would get some reports for use cases
>>>> Trigger.Once may work better (like your case) for the time period across
>>>> several minor releases, and then we can really decide. (IMHO, handling
>>>> backfill with Trigger.Once sounds to me as a workaround. Backfill may
>>>> warrant its own design to deal with.)
>>>>
>>>>
>>>>>
>>>>> Adam
>>>>>
>>>>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Bump to get a chance to expose the proposal to wider audiences.
>>>>>>
>>>>>> Given that there are not many active contributors/maintainers in area
>>>>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>>>>>> avoid being stuck. I'll give a final reminder early next week, and move
>>>>>> forward if there are no outstanding objections.
>>>>>>
>>>>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> I would like to hear voices about deprecating Trigger.Once, and
>>>>>>> promoting Trigger.AvailableNow as a replacement [1] in Structured 
>>>>>>> Streaming.
>>>>>>> (It doesn't mean we remove Trigger.Once now or near future. It
>>>>>>> probably requires another discussion at some time.)
>>>>>>>
>>>>>>> Rationalization:
>>>>>>>
>>>>>>> The expected behavior of Trigger.Once is like reading all available
>>>>>>> data after the last trigger and processing them. This holds true when 
>>>>>>> the
>>>>>>> last run was gracefully terminated, but there are cases streaming 
>>>>>>> queries
>>>>>>> to not be terminated gracefully. There is a possibility the last run may
>>>>>>> write the offset for the new batch before termination, then a new run of
>>>>>>> Trigger.Once only processes the data which was built in the latest
>>>>>>> unfinished batch and doesn't process new data.
>>>>>>>
>>>>>>> The behavior is not deterministic from the users' point of view, as
>>>>>>> end users wouldn't know whether the last run wrote the offset or not,
>>>>>>> unless they look into the query's checkpoint by themselves.
>>>>>>>
>>>>>>> While Trigger.AvailableNow came to solve the scalability issue on
>>>>>>> Trigger.Once, it also ensures that it tries to process all available 
>>>>>>> data
>>>>>>> at the point of time it is triggered, which consistently works as 
>>>>>>> expected
>>>>>>> behavior of Trigger.Once.
>>>>>>>
>>>>>>> Another issue on Trigger.Once is that it does not trigger a no-data
>>>>>>> batch immediately. When the watermark is calculated in batch N, it takes
>>>>>>> effect in batch N + 1. If the query is scheduled to be run per day, you 
>>>>>>> can
>>>>>>> see the output from the new watermark in the query run the next day. 
>>>>>>> Thanks
>>>>>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as 
>>>>>>> well
>>>>>>> before termination of the query.
>>>>>>>
>>>>>>> Please review and let us know if you have any feedback or concerns
>>>>>>> on the proposal.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Jungtaek Lim
>>>>>>>
>>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Adam Binford
>>>>>
>>>>
>>>
>>> --
>>> Adam Binford
>>>
>>

Reply via email to