It sounds like none of the approaches perfectly solve the issue of backfill.

1. Trigger.Once: scale issue
2. Trigger.AvailbleNow: watermark advancement issue (data getting dropped
due to watermark) depending on the order of data
3. Manual batch: state is not built from processing backfill

Handling a huge data (backfill) with a single microbatch without advancing
the watermark also requires thinking of "backfill-specific" situations -
state can grow unexpectedly since there is no way to purge without
watermark advancement. There seems to be not really a good approach to
solve all of the issues smoothly. One easier way as of now is to use
RocksDB state store provider to tolerate the huge size of state while we
enforce to not advance watermark, but the ideal approach still really
depends on the data source and the volume of the data to backfill.

Btw, don't worry if you get a feeling the deprecated API may get removed
too soon! Removing the API would require another serious discussion and
Spark community is generally not in favor of removing existing API.

2022년 7월 8일 (금) 오후 11:21, Adam Binford <adam...@gmail.com>님이 작성:

Dang I was hoping it was the second one. In our case the data is too large
> to run the whole backfill for the aggregation in a single batch (the
> shuffle is too big). We currently resort to manually batching (i.e. not
> streaming) the backlog (anything older than the watermark) when we need to
> reprocess, because we can't really know for sure our batches are processed
> in the correct event time order when starting from scratch.
>
> I'm not against deprecating Trigger.Once, just wanted to chime in that
> someone was using it! I'm itching to upgrade and try out the new stuff.
>
> Adam
>
> On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> Thanks for the input, Adam! Replying inline.
>>
>> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford <adam...@gmail.com> wrote:
>>
>>> We use Trigger.Once a lot, usually for backfilling data for new streams.
>>> I feel like I could see a continuing use case for "ignore trigger limits
>>> for this batch" (ignoring the whole issue with re-running the last failed
>>> batch vs a new batch), but we haven't actually been able to upgrade yet and
>>> try out Trigger.AvailableNow, so that could end up replacing all our use
>>> cases.
>>>
>>> One question I did have is how it does (or is supposed to) handle
>>> watermarking. Is the watermark determined for each batch independently like
>>> a normal stream, or is it kept constant for all batches in a single
>>> AvailableNow run? For example, we have a stateful job that we need to rerun
>>> occasionally, and it takes ~6 batches to backfill all the data before
>>> catching up to live data. With a Trigger.Once we know we won't accidentally
>>> drop any data due to the watermark when backfilling, because it's a single
>>> batch with no watermark yet. Would the same hold true if we backfill with
>>> Trigger.AvailableNow instead?
>>>
>>
>> The behavior is the former one. Each batch advances the watermark and
>> it's immediately reflected on the next batch.
>>
>> The number of batches Trigger.AvailableNow will execute depends on the
>> data source and the source option. For example, if you use Kafka data
>> source and use Trigger.AvailableNow without specifying any source option on
>> limiting the size, Trigger.AvailableNow will process all newly available
>> data as a single microbatch. It may not be still a single microbatch - it
>> would also handle the batch already logged in WAL first if any, as well as
>> handle no-data batch after the run of all microbatches. But I guess these
>> additional batches wouldn't hurt your case.
>>
>> If the data source doesn't allow processing all available data within a
>> single microbatch (depending on the implementation of default read limit),
>> you could probably either 1) set source options regarding to limit size as
>> an unrealistic one to enforce a single batch or 2) set the delay of
>> watermark as an unrealistic one. Both of the workarounds require you to use
>> different source options/watermark configuration for backfill vs normal run
>> - I agree it wouldn’t be a smooth one.
>>
>> This proposal does not aim to remove Trigger.Once in near future. As long
>> as we deprecate Trigger.Once, we would get some reports for use cases
>> Trigger.Once may work better (like your case) for the time period across
>> several minor releases, and then we can really decide. (IMHO, handling
>> backfill with Trigger.Once sounds to me as a workaround. Backfill may
>> warrant its own design to deal with.)
>>
>>
>>>
>>> Adam
>>>
>>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Bump to get a chance to expose the proposal to wider audiences.
>>>>
>>>> Given that there are not many active contributors/maintainers in area
>>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>>>> avoid being stuck. I'll give a final reminder early next week, and move
>>>> forward if there are no outstanding objections.
>>>>
>>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> I would like to hear voices about deprecating Trigger.Once, and
>>>>> promoting Trigger.AvailableNow as a replacement [1] in Structured 
>>>>> Streaming.
>>>>> (It doesn't mean we remove Trigger.Once now or near future. It
>>>>> probably requires another discussion at some time.)
>>>>>
>>>>> Rationalization:
>>>>>
>>>>> The expected behavior of Trigger.Once is like reading all available
>>>>> data after the last trigger and processing them. This holds true when the
>>>>> last run was gracefully terminated, but there are cases streaming queries
>>>>> to not be terminated gracefully. There is a possibility the last run may
>>>>> write the offset for the new batch before termination, then a new run of
>>>>> Trigger.Once only processes the data which was built in the latest
>>>>> unfinished batch and doesn't process new data.
>>>>>
>>>>> The behavior is not deterministic from the users' point of view, as
>>>>> end users wouldn't know whether the last run wrote the offset or not,
>>>>> unless they look into the query's checkpoint by themselves.
>>>>>
>>>>> While Trigger.AvailableNow came to solve the scalability issue on
>>>>> Trigger.Once, it also ensures that it tries to process all available data
>>>>> at the point of time it is triggered, which consistently works as expected
>>>>> behavior of Trigger.Once.
>>>>>
>>>>> Another issue on Trigger.Once is that it does not trigger a no-data
>>>>> batch immediately. When the watermark is calculated in batch N, it takes
>>>>> effect in batch N + 1. If the query is scheduled to be run per day, you 
>>>>> can
>>>>> see the output from the new watermark in the query run the next day. 
>>>>> Thanks
>>>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well
>>>>> before termination of the query.
>>>>>
>>>>> Please review and let us know if you have any feedback or concerns on
>>>>> the proposal.
>>>>>
>>>>> Thanks!
>>>>> Jungtaek Lim
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>>>>
>>>>
>>>
>>> --
>>> Adam Binford
>>>
>>
>
> --
> Adam Binford
>

Reply via email to