Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-18 Thread Ismael Juma
On Thu, Aug 2, 2018 at 9:55 AM Colin McCabe wrote: > On Wed, Aug 1, 2018, at 11:35, James Cheng wrote: > > I’m a little confused about something. Is this KIP focused on log > > cleaner exceptions in general, or focused on log cleaner exceptions due > > to disk failures? > > > > Will

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-17 Thread Jason Gustafson
> > The issue with the __consumer_offsets topic is problematic, that is true. > Nevertheless, I have some concerns with having a certain threshold of > `uncleanable.bytes`. There is now a chance that a single error in a big > partition (other than __consumer_offsets) marks the directory as offline

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-15 Thread Stanislav Kozlovski
Hi Jason, I was thinking about your suggestion. I agree that it makes sense to cap it at a certain threshold and it doesn't sound *too* restrictive to me either, considering the common case. The issue with the __consumer_offsets topic is problematic, that is true. Nevertheless, I have some

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-14 Thread Jason Gustafson
Sorry for the noise. Let me try again: My initial suggestion was to *track *the uncleanable disk space. > I can see why marking a log directory as offline after a certain threshold > of uncleanable disk space is more useful. > I'm not sure if we can set that threshold to be of certain size (e.g

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-14 Thread Jason Gustafson
Hey Stanislav, responses below: My initial suggestion was to *track *the uncleanable disk space. > I can see why marking a log directory as offline after a certain threshold > of uncleanable disk space is more useful. > I'm not sure if we can set that threshold to be of certain size (e.g 100GB) >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-10 Thread Stanislav Kozlovski
Hey Jason, My initial suggestion was to *track *the uncleanable disk space. I can see why marking a log directory as offline after a certain threshold of uncleanable disk space is more useful. I'm not sure if we can set that threshold to be of certain size (e.g 100GB) as log directories might

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-09 Thread Jason Gustafson
Hey Stanislav, Sorry, I was probably looking at an older version (I had the tab open for so long!). I have been thinking about `max.uncleanable.partitions` and wondering if it's what we really want. The main risk if the cleaner cannot clean a partition is eventually running out of disk space.

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-09 Thread Stanislav Kozlovski
Hey Jason, 1. *10* is the default value, it says so in the KIP 2. This is a good catch. As the current implementation stands, it's not a useful metric since the thread continues to run even if all log directories are offline (although I'm not sure what the broker's behavior is in that scenario).

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-07 Thread Jason Gustafson
Hi Stanislav, Just a couple quick questions: 1. I may have missed it, but what will be the default value for `max.uncleanable.partitions`? 2. It seems there will be some impact for users that monitoring "time-since-last-run-ms" in order to detect cleaner failures. Not sure it's a major concern,

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-06 Thread Ray Chiang
I'm okay with that. -Ray On 8/6/18 10:59 AM, Colin McCabe wrote: Perhaps we could start with max.uncleanable.partitions and then implement max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be necessary? What do you think? regards, Colin On Sat, Aug 4, 2018, at

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-06 Thread Colin McCabe
Perhaps we could start with max.uncleanable.partitions and then implement max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be necessary? What do you think? regards, Colin On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote: > Hey Ray, > > Thanks for the

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-04 Thread Stanislav Kozlovski
Hey Ray, Thanks for the explanation. In regards to the configuration property - I'm not sure. As long as it has sufficient documentation, I find "max.uncleanable.partitions" to be okay. If we were to add the distinction explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ? On

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-02 Thread Ray Chiang
One more thing occurred to me.  Should the configuration property be named "max.uncleanable.partitions.per.disk" instead? -Ray On 8/1/18 9:11 AM, Stanislav Kozlovski wrote: Yes, good catch. Thank you, James! Best, Stanislav On Wed, Aug 1, 2018 at 5:05 PM James Cheng wrote: Can you

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-02 Thread Ray Chiang
I see this as a fix for the LogCleaner.  Uncaught exceptions kill the CleanerThread and this is viewed as undesired behavior.  Some other ways to think of this fix: 1) If you have occasional corruption in some log segments, then with each broker restart, the LogCleaner will lose its state,

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-02 Thread Colin McCabe
On Wed, Aug 1, 2018, at 11:35, James Cheng wrote: > I’m a little confused about something. Is this KIP focused on log > cleaner exceptions in general, or focused on log cleaner exceptions due > to disk failures? > > Will max.uncleanable.partitions apply to all exceptions (including log >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-01 Thread James Cheng
I’m a little confused about something. Is this KIP focused on log cleaner exceptions in general, or focused on log cleaner exceptions due to disk failures? Will max.uncleanable.partitions apply to all exceptions (including log cleaner logic errors) or will it apply to only disk I/o exceptions?

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-01 Thread Stanislav Kozlovski
Yes, good catch. Thank you, James! Best, Stanislav On Wed, Aug 1, 2018 at 5:05 PM James Cheng wrote: > Can you update the KIP to say what the default is for > max.uncleanable.partitions? > > -James > > Sent from my iPhone > > > On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski > wrote: > > > >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-01 Thread James Cheng
Can you update the KIP to say what the default is for max.uncleanable.partitions? -James Sent from my iPhone > On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski > wrote: > > Hey group, > > I am planning on starting a voting thread tomorrow. Please do reply if you > feel there is anything

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-01 Thread Stanislav Kozlovski
Hey group, I just wanted to note that I have an implementation ready for review. Feel free to take a quick look and raise any concerns you might have in due time. I plan on starting the voting thread tomorrow. Best, Stanislav On Wed, Aug 1, 2018 at 10:01 AM Stanislav Kozlovski wrote: > Hey

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-08-01 Thread Stanislav Kozlovski
Hey Ray, * Yes, we'd have the logDir as a tag in the metric * The intention is to have Int.MaxValue as the maximum uncleanable partitions count * My idea is to store the marked logs (actually partitions) in memory instead of the ".skip" files to keep the change simple. I have also decided to omit

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-31 Thread Ray Chiang
I had one question that I was trying to do some investigation before I asked, but I'm having some issues with my JMX browser right now. * For the uncleanable-partitions-count metric, is that going to be per-logDir entry? * For max.uncleanable.partitions, is the intention to have -1 be

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-31 Thread Stanislav Kozlovski
Hey group, I am planning on starting a voting thread tomorrow. Please do reply if you feel there is anything left to discuss. Best, Stanislav On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski wrote: > Hey, Ray > > Thanks for pointing that out, it's fixed now > > Best, > Stanislav > > On

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-27 Thread Stanislav Kozlovski
Hey, Ray Thanks for pointing that out, it's fixed now Best, Stanislav On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang wrote: > Thanks. Can you fix the link in the "KIPs under discussion" table on > the main KIP landing page > < >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-27 Thread Ray Chiang
Thanks.  Can you fix the link in the "KIPs under discussion" table on the main KIP landing page ?  I tried, but the Wiki won't let me. -Ray On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: Hey guys, @Colin - good

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-26 Thread Stanislav Kozlovski
Hey guys, @Colin - good point. I added some sentences mentioning recent improvements in the introductory section. *Disk Failure* - I tend to agree with what Colin said - once a disk fails, you don't want to work with it again. As such, I've changed my mind and believe that we should mark the

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-26 Thread Ray Chiang
Thanks for creating this KIP Stanislav.  My observations: 1) I agree with Colin that threads automatically re-launching threads generally isn't a great idea.  Metrics and/or monitoring threads are generally much safer.  And there's always the issue of what happens if the re-launcher dies?

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-25 Thread Dhruvil Shah
For the cleaner thread specifically, I do not think respawning will help at all because we are more than likely to run into the same issue again which would end up crashing the cleaner. Retrying makes sense for transient errors or when you believe some part of the system could have healed itself,

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-25 Thread Ron Dagostino
<< wrote: > On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > > Hi Stanislav! Thanks for this KIP! > > > > I agree that it would be good if the LogCleaner were more tolerant of > > errors. Currently, as you said, once it dies, it stays dead. > > > > Things are better now than they used to be.

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-25 Thread Colin McCabe
On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > Hi Stanislav! Thanks for this KIP! > > I agree that it would be good if the LogCleaner were more tolerant of > errors. Currently, as you said, once it dies, it stays dead. > > Things are better now than they used to be. We have the metric >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-24 Thread Stanislav Kozlovski
Hey James, Ted, @James - Thanks for showing me some of the changes, that was informative. * *Log Cleaner Thread Revival* - I also acknowledge that could be useful. My concern is that if the thread has died, there is most likely something wrong with either the disk or the software and since both

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-24 Thread Ted Yu
As James pointed out in his reply, topic-partition name can be long. It is not necessary to repeat the topic name for each of its partitions. How about the following format: topic-name1-{partition1, partition2, etc} That is, topic name only appears once. Cheers On Mon, Jul 23, 2018 at 9:08 PM

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-24 Thread James Cheng
Hi Stanislav! Thanks for this KIP! I agree that it would be good if the LogCleaner were more tolerant of errors. Currently, as you said, once it dies, it stays dead. Things are better now than they used to be. We have the metric

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-23 Thread Stanislav Kozlovski
Hi Ted, Yes, absolutely. Thanks for pointing that out! On Mon, Jul 23, 2018 at 6:12 PM Ted Yu wrote: > For `uncleanable-partitions`, should the example include topic name(s) ? > > Cheers > > On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski < > stanis...@confluent.io> > wrote: > > > I

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-23 Thread Ted Yu
For `uncleanable-partitions`, should the example include topic name(s) ? Cheers On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski wrote: > I renamed the KIP and that changed the link. Sorry about that. Here is the > new link: > >

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-23 Thread Stanislav Kozlovski
I renamed the KIP and that changed the link. Sorry about that. Here is the new link: https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski wrote: > Hey group, > > I created a new KIP about making log

[DISCUSS] KIP-346 - Limit blast radius of log compaction failure

2018-07-23 Thread Stanislav Kozlovski
Hey group, I created a new KIP about making log compaction more fault-tolerant. Please give it a look here and please share what you think, especially in regards to the points in the "Needs Discussion" paragraph. KIP: KIP-346