Odd number of files on one node during repair (was: To Repair or Not to Repair)
On Tue, Aug 13, 2019 at 6:14 PM Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > > I was wondering about this again, as I've noticed one of the nodes in our > cluster accumulating ten times the number of files compared to the average > across the rest of cluster. The files are all coming from a table with > TWCS and repair (running with Reaper) is ongoing. The sudden growth > started around 24 hours ago as the affected node was restarted due to > failing AWS EC2 System check. > And now as the next weekly repair has started, the same node shows the problem again. Number of files went up to 6,000 in the last 7 hours, as compared to the average of ~1,500 on the rest of the nodes, which remains more or less constant. Any advice how to debug it? Regards, -- Alex
Re: To Repair or Not to Repair
On Thu, Mar 14, 2019 at 9:55 PM Jonathan Haddad wrote: > My coworker Alex (from The Last Pickle) wrote an in depth blog post on > TWCS. We recommend not running repair on tables that use TWCS. > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html > Hi, I was wondering about this again, as I've noticed one of the nodes in our cluster accumulating ten times the number of files compared to the average across the rest of cluster. The files are all coming from a table with TWCS and repair (running with Reaper) is ongoing. The sudden growth started around 24 hours ago as the affected node was restarted due to failing AWS EC2 System check. Now I'm thinking again if we should be running those repairs at all. ;-) In the Summary of the blog post linked above, the following is written: It is advised to disable read repair on TWCS tables, and use an agressive tombstone purging strategy as digest mismatches during reads will still trigger read repairs. Was it meant to read "disable anti-entropy repair" instead? I find it confusing otherwise. Regards, -- Alex
RE: To Repair or Not to Repair
Beautiful, thank you very much! From: Jonathan Haddad [mailto:j...@jonhaddad.com] Sent: Thursday, March 14, 2019 4:55 PM To: user Subject: Re: To Repair or Not to Repair My coworker Alex (from The Last Pickle) wrote an in depth blog post on TWCS. We recommend not running repair on tables that use TWCS. http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html It's enough of a problem that we added a feature into Reaper to auto-blacklist TWCS / DTCS tables from being repaired, we wrote about it here: http://thelastpickle.com/blog/2019/02/15/reaper-1_4-released.html Hope this helps! Jon On Fri, Mar 15, 2019 at 9:48 AM Nick Hatfield mailto:nick.hatfi...@metricly.com>> wrote: It seems that running a repair works really well, quickly and efficiently when repairing a column family that does not use TWCS. Has anyone else had a similar experience? Wondering if running TWCS is doing more harm than good as it chews up a lot of cpu and for extended periods of time in comparison to CF’s with a compaction strategy of STCS Thanks, -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: To Repair or Not to Repair
My coworker Alex (from The Last Pickle) wrote an in depth blog post on TWCS. We recommend not running repair on tables that use TWCS. http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html It's enough of a problem that we added a feature into Reaper to auto-blacklist TWCS / DTCS tables from being repaired, we wrote about it here: http://thelastpickle.com/blog/2019/02/15/reaper-1_4-released.html Hope this helps! Jon On Fri, Mar 15, 2019 at 9:48 AM Nick Hatfield wrote: > It seems that running a repair works really well, quickly and efficiently > when repairing a column family that does not use TWCS. Has anyone else had > a similar experience? Wondering if running TWCS is doing more harm than > good as it chews up a lot of cpu and for extended periods of time in > comparison to CF’s with a compaction strategy of STCS > > > > > > Thanks, > -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
To Repair or Not to Repair
It seems that running a repair works really well, quickly and efficiently when repairing a column family that does not use TWCS. Has anyone else had a similar experience? Wondering if running TWCS is doing more harm than good as it chews up a lot of cpu and for extended periods of time in comparison to CF's with a compaction strategy of STCS Thanks,
Re: what's the difference between repair CF separately and repair the entire node?
On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
is 0.8 ready for production use? as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? some related discussion here: http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html but my personal answer is yes. as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? Repair problems in 0.7 don't hit everyone equally. For some people, it works relatively well even if not in the most efficient ways. Also, for some workload (if you don't do much deletes for instance), you can set a big gc_grace_seconds value (say a month) and only run repair that often, which can make repair inefficiencies more bearable. That being said, I can't speak for many companies, but I do advise evaluating an upgrade to 0.8. -- Sylvain On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
It was mentioned in another thread that Twitter uses 0.8 in productionfor me that was a fairly strong testimonial... On Sep 14, 2011 9:28 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
thanks a lot for the help! I have read the post and think 0.8 might be good enough for me, especially 0.8.5. also change gc_grace_seconds is a acceptable solution. On Wed, Sep 14, 2011 at 4:03 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu springri...@gmail.com wrote: is 0.8 ready for production use? some related discussion here: http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html but my personal answer is yes. as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem? Repair problems in 0.7 don't hit everyone equally. For some people, it works relatively well even if not in the most efficient ways. Also, for some workload (if you don't do much deletes for instance), you can set a big gc_grace_seconds value (say a month) and only run repair that often, which can make repair inefficiencies more bearable. That being said, I can't speak for many companies, but I do advise evaluating an upgrade to 0.8. -- Sylvain On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu springri...@gmail.com wrote: me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, To add to the list of things repair does wrong in 0.7, we'll have to add that if one of the node participating in the repair (so any node that share a range with the node on which repair was started) goes down (even for a short time), then the repair will simply hang forever doing nothing. And no specific error message will be logged. That could be what happened. Again, recent releases of 0.8 fix that too. -- Sylvain I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
On Tue, Sep 13, 2011 at 3:57 PM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. Why is it serious to do repair one CF at a time, if I cannot do that it at a CF level, then does it mean that I cannot use more than 50% disk space? Is this specific to this problem or is that a general statement? I ask because I am planning on doing this so I can limit the max disk overhead to be a CF (+ some factor) worth. I am going to be testing this in the next couple of weeks or so. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
It is a serious issue if you really need to repair one CF at the time. Why is it serious to do repair one CF at a time, if I cannot do that it at a CF level, then does it mean that I cannot use more than 50% disk space? Is this specific to this problem or is that a general statement? I ask because I am planning on doing this so I can limit the max disk overhead to be a CF (+ some factor) worth. I am going to be testing this in the next couple of weeks or so. The bug in 0.7 is causes data to be streamed for all CF:s when doing a repair on one. So, if you specifically need to repair a specific CF at a time, such as because you're trying to repair a small CF quite often while leaving a huge CF with less frequent repairs, you have an issue. If you're just wanting to repair the entire keyspace, it doesn't affect you. I'm not sure how this relates to the 50% disk space bit though. -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
It's okay but won't do what you want; due to a bug you'll see streaming of data for other column families than the one you're trying to repair. This will be fixed in 1.0. I think we might be running into this. Is CASSANDRA-2280 the issue you're referring to? Yes. Sorry for not providing the reference. -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
me neither don't want to repair one CF at the time. the node repair took a week and still running, compactionstats and netstream shows nothing is running on every node, and also no error message, no exception, really no idea what was it doing, I stopped yesterday. maybe I should run repair again while disable compaction on all nodes? thanks! On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller peter.schul...@infidyne.com wrote: I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. It is a serious issue if you really need to repair one CF at the time. However, looking at your original post it seems this is not necessarily your issue. Do you need to, or was your concern rather the overall time repair took? There are other things that are improved in 0.8 that affect 0.7. In particular, (1) in 0.7 compaction, including validating compactions that are part of repair, is non-concurrent so if your repair starts while there is a long-running compaction going it will have to wait, and (2) semi-related is that the merkle tree calculation that is part of repair/anti-entropy may happen out of synch if one of the nodes participating happen to be busy with compaction. This in turns causes additional data to be sent as part of repair. That might be why your immediately following repair took a long time, but it's difficult to tell. If you're having issues with repair and large data sets, I would generally say that upgrading to 0.8 is recommended. However, if you're on 0.7.4, beware of https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
I am using 0.7.4. so it is always okay to do the routine repair on Column Family basis? thanks! It's okay but won't do what you want; due to a bug you'll see streaming of data for other column families than the one you're trying to repair. This will be fixed in 1.0. -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
On Mon, Sep 12, 2011 at 1:44 PM, Peter Schuller peter.schul...@infidyne.com wrote: I am using 0.7.4. so it is always okay to do the routine repair on Column Family basis? thanks! It's okay but won't do what you want; due to a bug you'll see streaming of data for other column families than the one you're trying to repair. This will be fixed in 1.0. I think we might be running into this. Is CASSANDRA-2280 the issue you're referring to? Jim
Re: what's the difference between repair CF separately and repair the entire node?
I think it is a serious problem since I can not repair. I am using cassandra on production servers. is there some way to fix it without upgrade? I heard of that 0.8.x is still not quite ready in production environment. thanks! On Tue, Sep 13, 2011 at 1:44 AM, Peter Schuller peter.schul...@infidyne.com wrote: I am using 0.7.4. so it is always okay to do the routine repair on Column Family basis? thanks! It's okay but won't do what you want; due to a bug you'll see streaming of data for other column families than the one you're trying to repair. This will be fixed in 1.0. -- / Peter Schuller (@scode on twitter)
Re: what's the difference between repair CF separately and repair the entire node?
On Fri, Sep 9, 2011 at 4:18 AM, Yan Chunlu springri...@gmail.com wrote: I have 3 nodes and RF=3. I tried to repair every node in the cluster by using nodetool repair mykeyspace mycf on every column family. it finished within 3 hours, the data size is no more than 50GB. after the repair, I have tried using nodetool repair immediately to repair the entire node, but 48 hours has past it still going on. compactionstats shows it is doing SSTable rebuild. so I am frustrating about why does nodetool repair so slow? how does it different with repair every CF? What version of Cassandra are you using. If you are using something 0.8.2, then it may be because nodetool repair used to schedule its sub-task poorly, in ways that were counter-productive (fixed by CASSANDRA-2816). If you are using a more recent version, then it's an interesting report. I didn't tried to repair the system keyspace, does it also need to repair? It doesn't. -- Sylvain
what's the difference between repair CF separately and repair the entire node?
I have 3 nodes and RF=3. I tried to repair every node in the cluster by using nodetool repair mykeyspace mycf on every column family. it finished within 3 hours, the data size is no more than 50GB. after the repair, I have tried using nodetool repair immediately to repair the entire node, but 48 hours has past it still going on. compactionstats shows it is doing SSTable rebuild. so I am frustrating about why does nodetool repair so slow? how does it different with repair every CF? I didn't tried to repair the system keyspace, does it also need to repair? thanks!