Re: [Discuss] Semantics of event time for state TTL

Aljoscha Krettek Mon, 08 Apr 2019 02:12:34 -0700

Oh boy, this is an interesting pickle.

For *last-access-timestamp*, I think only *event-time-of-current-record* makes 
sense. I’m looking at this from a GDPR/regulatory compliance perspective. If 
you update a state, by say storing the event you just received in state, you 
want to use the exact timestamp of that event to to expiration. Both 
*max-timestamp-of-data-seen-so-far* and *last-watermark* suffer from problems 
in edge cases: if the timestamp of an event you receive is quite a bit earlier 
than other timestamps that we have seen so far (i.e. the event is late) we 
would artificially lengthen the TTL of that event (which is stored in state) 
and would therefore break regulatory requirements. Always using the timestamp 
of an event doesn’t suffer from that problem.


For *expiration-check-time*, both *last-watermark* and 
*current-processing-time* could make sense but I’m leaning towards 
*processing-time*. The reason is again the GDPR/compliance view: if we have an 
old savepoint with data that should have been expired by now but we re-process 
it with *last-watermark* expiration, this means that we will get to “see” that 
state even though we shouldn’t allowed to be. If we use 
*current-processing-time* for expiration, we wouldn’t have that problem because 
that old data (according to their event-time timestamp) would be properly 
cleaned up and access would be prevented.

To sum up:
last-access-timestamp: event-time of event
expiration-check-time: processing-time
 
What do you think?

Aljoscha

> On 6. Apr 2019, at 01:30, Konstantin Knauf <konstan...@ververica.com> wrote:
> 
> Hi Andrey,
> 
> I agree with Elias. This would be the most natural behavior. I wouldn't add
> additional slightly different notions of time to Flink.
> 
> As I can also see a use case for the combination
> 
> * Timestamp stored: Event timestamp
> * Timestamp to check expiration: Processing Time
> 
> we could (maybe in a second step) add the possibility to mix and match time
> characteristics for both aspects.
> 
> Cheers,
> 
> Konstantin
> 
> On Thu, Apr 4, 2019 at 7:59 PM Elias Levy <fearsome.lucid...@gmail.com>
> wrote:
> 
>> My 2c:
>> 
>> Timestamp stored with the state value: Event timestamp
>> Timestamp used to check expiration: Last emitted watermark
>> 
>> That follows the event time processing model used elsewhere is Flink.
>> E.g. events are segregated into windows based on their event time, but the
>> windows do not fire until the watermark advances past the end of the window.
>> 
>> 
>> On Thu, Apr 4, 2019 at 7:55 AM Andrey Zagrebin <and...@ververica.com>
>> wrote:
>> 
>>> Hi All,
>>> 
>>> As you might have already seen there is an effort tracked in FLINK-12005
>>> [1] to support event time scale for state with time-to-live (TTL) [2].
>>> While thinking about design, we realised that there can be multiple
>>> options
>>> for semantics of this feature, depending on use case. There is also
>>> sometimes confusion because of event time out-of-order nature in Flink. I
>>> am starting this thread to discuss potential use cases of this feature and
>>> their requirements for interested users and developers. There was already
>>> discussion thread asking about event time for TTL and it already contains
>>> some thoughts [3].
>>> 
>>> There are two semantical cases where we use time for TTL feature at the
>>> moment. Firstly, we store timestamp of state last access/update. Secondly,
>>> we use this timestamp and current timestamp to check expiration and
>>> garbage
>>> collect state at some point later.
>>> 
>>> At the moment, Flink supports *only processing time* for both timestamps:
>>> state *last access and current timestamp*. It is basically current local
>>> system unix epoch time.
>>> 
>>> When it comes to event time scale, we also need to define what Flink
>>> should
>>> use for these two timestamps. Here I will list some options and their
>>> possible pros&cons for discussion. There might be more depending on use
>>> case.
>>> 
>>> *Last access timestamp (stored in backend with the actual state value):*
>>> 
>>>   - *Event timestamp of currently being processed record.* This seems to
>>>   be the simplest option and it allows user-defined timestamps in state
>>>   backend. The problem here might be instability of event time which can
>>> not
>>>   only increase but also decrease if records come out of order. This can
>>> lead
>>>   to rewriting the state timestamp to smaller value which is unnatural
>>> for
>>>   the notion of time.
>>>   - *Max event timestamp of records seen so far for this record key.*
>>> This
>>>   option is similar to the previous one but it tries to fix the notion of
>>>   time to make it always increasing. Maintaining this timestamp has also
>>>   performance implications because the previous timestamp needs to be
>>> read
>>>   out to decide whether to rewrite it.
>>>   - *Last emitted watermark*. This is what we usually use for other
>>>   operations to trigger some actions in Flink, like timers and windows
>>> but it
>>>   can be unrelated to the record which actually triggers the state
>>> update.
>>> 
>>> *Current timestamp to check expiration:*
>>> 
>>>   - *Event timestamp of last processed record.* Again quite simple but
>>>   unpredictable option for out-of-order events. It can potentially lead
>>> to
>>>   undesirable expiration of late buffered data in state without control.
>>>   - *Max event timestamp of records seen so far for operator backend.*
>>> Again
>>>   similar to previous one, more stable but still user does not have too
>>> much
>>>   control when to expire state.
>>>   - *Last emitted watermark*. Again, this is what we usually use for
>>> other
>>>   operations to trigger some actions in Flink, like timers and windows.
>>> It
>>>   also gives user some control to decide when state is expired (up to
>>> which
>>>   point in event time) by emitting certain watermark. It is more
>>> flexible but
>>>   complicated. If some watermark emitting strategy is already used for
>>> other
>>>   operations, it might be not optimal for TTL and delay state cleanup.
>>>   - *Current processing time.* This option is quite simple, It would mean
>>>   that user just decides which timestamp to store but it will expire in
>>> real
>>>   time. For data privacy use case, it might be better because we want
>>> state
>>>   to be unavailable in particular real moment of time since the
>>> associated
>>>   piece of data was created in event time. For long term approximate
>>> garbage
>>>   collection, it might be not a problem as well. For quick expiration,
>>> the
>>>   time skew between event and processing time can lead again to premature
>>>   deletion of late data and user cannot delay it.
>>> 
>>> We could also make this behaviour configurable. Another option is to make
>>> time provider pluggable for users. The interface can give users context
>>> (currently processed record, watermark etc) and ask them which timestamp
>>> to
>>> use. This is more complicated though.
>>> 
>>> Looking forward for your feedback.
>>> 
>>> Best,
>>> Andrey
>>> 
>>> [1] https://issues.apache.org/jira/browse/FLINK-12005
>>> [2]
>>> 
>>> https://docs.google.com/document/d/1SI_WoXAfOd4_NKpGyk4yh3mf59g12pSGNXRtNFi-tgM
>>> [3]
>>> 
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/State-TTL-in-Flink-1-6-0-td22509.html
>>> 
>> 
> 
> -- 
> 
> Konstantin Knauf | Solutions Architect
> 
> +49 160 91394525
> 
> <https://www.ververica.com/>
> 
> Follow us @VervericaData
> 
> --
> 
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
> 
> Stream Processing | Event Driven | Real Time
> 
> --
> 
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> 
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [Discuss] Semantics of event time for state TTL

Reply via email to