Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-11-06 Thread xiongqi wu
Dong, Thanks for the comments. I have updated the KIP based on your comments. below is reply to your questions: 1. We only calculate this metric for log compaction that is determined by max compaction lag. So we only collect non-negative metrics. The log cleaner is consistently running with

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-11-06 Thread Dong Lin
Hey Xiongqi, Thanks for the update. A few more comments below 1) According to the definition of kafka.log:type=LogCleaner,name=max-compaction-delay, it seems that the metric value will be a large negative number if max.compaction.lag.ms is MAX_LONG. Would this be a problem? Also, it seems weird

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-10-29 Thread xiongqi wu
Hi Dong, I have updated the KIP to address your comments. One correction to previous Email: after offline discussion with Dong, we decide to use MAX_LONG as default value for max.compaction.lag.ms. Xiongqi (Wesley) Wu On Mon, Oct 29, 2018 at 12:15 PM xiongqi wu wrote: > Hi Dong, > > Thank

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-10-29 Thread xiongqi wu
Hi Dong, Thank you for your comment. See my inline comments. I will update the KIP shortly. Xiongqi (Wesley) Wu On Sun, Oct 28, 2018 at 9:17 PM Dong Lin wrote: > Hey Xiongqi, > > Sorry for late reply. I have some comments below: > > 1) As discussed earlier in the email list, if the topic is

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-10-28 Thread Dong Lin
Hey Xiongqi, Sorry for late reply. I have some comments below: 1) As discussed earlier in the email list, if the topic is configured with both deletion and compaction, in some cases messages produced a long time ago can not be deleted based on time. This is a valid use-case because we actually

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-10-16 Thread xiongqi wu
Mayuresh, Thanks for the comments. The requirement is that we need to pick up segments that are older than maxCompactionLagMs for compaction. maxCompactionLagMs is an upper-bound, which implies that picking up segments for compaction earlier doesn't violated the policy. We use the creation time

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-10-15 Thread Mayuresh Gharat
Hi Wesley, Thanks for the KIP and sorry for being late to the party. I wanted to understand, the scenario you mentioned in Proposed changes : - > > Estimate the earliest message timestamp of an un-compacted log segment. we > only need to estimate earliest message timestamp for un-compacted log

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-09-03 Thread Brett Rann
Might also be worth moving to a vote thread? Discussion seems to have gone as far as it can. > On 4 Sep 2018, at 12:08, xiongqi wu wrote: > > Brett, > > Yes, I will post PR tomorrow. > > Xiongqi (Wesley) Wu > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann wrote: > > > +1 (non-binding)

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-09-03 Thread xiongqi wu
Brett, Yes, I will post PR tomorrow. Xiongqi (Wesley) Wu On Sun, Sep 2, 2018 at 6:28 PM Brett Rann wrote: > +1 (non-binding) from me on the interface. I'd like to see someone familiar > with > the code comment on the approach, and note there's a couple of different > approaches: what's

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-09-02 Thread Brett Rann
+1 (non-binding) from me on the interface. I'd like to see someone familiar with the code comment on the approach, and note there's a couple of different approaches: what's documented in the KIP, and what Xiaohe Dong was working on here:

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-27 Thread xiongqi wu
Hi All, Do you have any additional comments on this KIP? On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu wrote: > on 2) > The offsetmap is built starting from dirty segment. > The compaction starts from the beginning of the log partition. That's how > it ensure the deletion of tomb keys. > I

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread xiongqi wu
on 2) The offsetmap is built starting from dirty segment. The compaction starts from the beginning of the log partition. That's how it ensure the deletion of tomb keys. I will double check tomorrow. Xiongqi (Wesley) Wu On Thu, Aug 16, 2018 at 6:46 PM Brett Rann wrote: > To just clarify a bit

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread Brett Rann
To just clarify a bit on 1. whether there's an external storage/DB isn't relevant here. Compacted topics allow a tombstone record to be sent (a null value for a key) which currently will result in old values for that key being deleted if some conditions are met. There are existing controls to

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread xiongqi wu
1, Owner of data (in this sense, kafka is the not the owner of data) should keep track of lifecycle of the data in some external storage/DB. The owner determines when to delete the data and send the delete request to kafka. Kafka doesn't know about the content of data but to provide a mean for

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread Dong Lin
Hey Xiongqi, Thanks for the update. I have two questions for the latest KIP. 1) The motivation section says that one use case is to delete PII (Personal Identifiable information) data within 7 days while keeping non-PII indefinitely in compacted format. I suppose the use-case depends on the

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread xiongqi wu
Hi Xiaohe, Quick note: 1) Use minimum of segment.ms and max.compaction.lag.ms 2) I am not sure if I get your second question. first, we have jitter when we roll the active segment. second, on each compaction, we compact upto the offsetmap could allow. Those will not

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread xiongqi wu
Hi Brett, In my opinion, we don't need very accurate estimation of first record timestamp because the compaction processing itself might take some time. There is no strictly guaranteed that after the time interval "T", the compaction is finished. Rather than, after time "T" + some compaction job

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-16 Thread XIAOHE DONG
Hi Xiongqi Thanks for thinking about implementing this as well. :) I was thinking about using `segment.ms` to trigger the segment roll. Also, its value can be the largest time bias for the record deletion. For example, if the `segment.ms` is 1 day and `max.compaction.ms` is 30 days, the

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread Brett Rann
> (segment.largestTimestamp is lastModified time of the log segment or max timestamp we see for the log segment. Due to the lack of record timestamp, segment.largestTimestamp might be earlier than the actual timestamp of latest record of that segment.). I'm curious about the mention of last

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread Brett Rann
An API was suggested by Gwen and James when I discussed it with them. For me I can think of it as a use case for scheduling compaction rather than relying on an config based time trigger. We're looking at creating some potentially very large compacted topics for event sourcing and from an

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread xiongqi wu
Brett, Thank you for your comments. I was thinking since we already has immediate compaction setting by setting min dirty ratio to 0, so I decide to use "0" as disabled state. I am ok to go with -1(disable), 0 (immediate) options. For the implementation, there are a few differences between mine

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread Brett Rann
Eno, For us as well the requirement is around compacted topics because they are the topics that already facilitate selective deletes. Currently they allow specifying a minimum life time, but lacks the ability to specify a maximum life time. For non compacted topics there's no ability to delete

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread Brett Rann
We've been looking into this too. Mailing list: https://lists.apache.org/thread.html/ed7f6a6589f94e8c2a705553f364ef599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E jira wish: https://issues.apache.org/jira/browse/KAFKA-7137 confluent slack discussion:

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-15 Thread xiongqi wu
Eno, Dong, I have updated the KIP. We decide not to address the issue that we might have for both compaction and time retention enabled topics (see the rejected alternative item 2). This KIP will only ensure log can be compacted after a specified time-interval. As suggested by Dong, we will

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-14 Thread xiongqi wu
Per discussion with Dong, he made a very good point that if compaction and time based retention are both enabled on a topic, the compaction might prevent records from being deleted on time. The reason is when compacting multiple segments into one single segment, the newly created segment will

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-14 Thread xiongqi wu
Dong, Thanks for the comment. There are two retention policy: log compaction and time based retention. Log compaction: we have use cases to keep infinite retention of a topic (only compaction). GDPR cares about deletion of PII (personal identifiable information) data. Since Kafka doesn't know

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-14 Thread Eno Thereska
Adding to this, what about topics that are not log compacted? As Dong says, "one of the GDPR requirement is that we can not keep messages longer than e.g. 30 days in storage (e.g. Kafka)". The GDPR requirement must hold irrespective of the low level details, on whether the topic is compacted or

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread Dong Lin
Hey Xiongqi, Thanks for the KIP. I have two questions regarding the use-case for meeting GDPR requirement. 1) If I recall correctly, one of the GDPR requirement is that we can not keep messages longer than e.g. 30 days in storage (e.g. Kafka). Say there exists a partition p0 which contains

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread xiongqi wu
Hi Eno, The GDPR request we are getting here at linkedin is if we get a request to delete a record through a null key on a log compacted topic, we want to delete the record via compaction in a given time period like 2 days (whatever is required by the policy). There might be other issues (such

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread xiongqi wu
Yes. we want to enforce a max time interval from a message arrival time to the time the corresponding log segment needs to be compacted. Today, if the message arriving rate is low for a log compacted topic, the dirty ratio increases very slowly. As a result, a log segment might be un-compacted

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread Eno Thereska
Hello, Thanks for the KIP. I'd like to see a more precise definition of what part of GDPR you are targeting as well as some sort of verification that this KIP actually addresses the problem. Right now I find this a bit vague: "Ability to delete a log message through compaction in a timely manner

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread Guozhang Wang
Guess I need to carefully read the wiki page before asking :) Thanks! Another qq after reading the proposal: is it a complimentary to KIP-58 ( https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable), just that KIP-58 is a "upper-bound" on what messages

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread xiongqi wu
HI Guozhang, As I mentioned in the motivation section, KIP-280 focuses on how to compact the log segment to resolve the out of order messages compaction issue. The issue we try to address in this KIP is different: we want to introduce a compaction policy so that a log segment can be pickup for

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread Guozhang Wang
Hello Xiongqi, I think this KIP is already been covered in KIP-280? Could you check out that one and see if it is the case. Guozhang On Mon, Aug 13, 2018 at 1:23 PM, xiongqi wu wrote: > Hi Kafka, > > Just updated the confluence page to include the link to this KIP. > > Any comment will be

Re: [DISCUSS] KIP-354 Time-based log compaction policy

2018-08-13 Thread xiongqi wu
Hi Kafka, Just updated the confluence page to include the link to this KIP. Any comment will be appreciated: https://cwiki.apache.org/confluence/display/KAFKA/KIP-354%3A+Time-based+log+compaction+policy Thank you. Xiongqi (Wesley) Wu On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu wrote: > Hi