I think Spark is trying to ensure that it reads the input "continuously"
without any missing. Technically it may be valid to say the situation is a
kind of "data-loss", as the query couldn't process the offsets which are
being thrown out, and owner of the query needs to be careful as it affects
the result.

If your streaming query keeps up with input rate then it's pretty rare for
the query to go under retention. Even it lags a bit, it'd be safe if
retention is set to enough period. The ideal state would be ensuring your
query to process all offsets before they are thrown out by retention (don't
leave the query lagging behind - either increasing processing power or
increasing retention duration, though most probably you'll need to do
former), but if you can't make sure and if you understand the risk then yes
you can turn off the option and take the risk.


On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li <liruijin...@gmail.com> wrote:

> I see, I wasn’t sure if that would work as expected. The docs seems to
> suggest to be careful before turning off that option, and I’m not sure why
> failOnDataLoss is true by default.
>
> On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz <brk...@gmail.com> wrote:
>
>> Just set `failOnDataLoss=false` as an option in readStream?
>>
>> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li <liruijin...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have a spark structured streaming app that is consuming from a kafka
>>> topic with retention set up. Sometimes I face an issue where my query has
>>> not finished processing a message but the retention kicks in and deletes
>>> the offset, which since I use the default setting of “failOnDataLoss=true”
>>> causes my query to fail. The solution I currently have is manual, deleting
>>> the offsets directory and rerunning.
>>>
>>> I instead like to have spark automatically fall back to the earliest
>>> offset available. The solutions I saw recommend setting auto.offset =
>>> earliest, but for structured streaming, you cannot set that. How do I do
>>> this for structured streaming?
>>>
>>> Thanks!
>>> --
>>> Cheers,
>>> Ruijing Li
>>>
>> --
> Cheers,
> Ruijing Li
>

Reply via email to