[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747681#comment-17747681 ] Jon Haddad commented on CASSANDRA-8460: --- {quote} Tiering to spinning disk or cheaper block devices is fine. It's a win. It's easy to reason about - probably just implement it via compaction and all the read and write path stay exactly the same. But I think the industry trends would suggest this is suboptimal - moving this to a fast object store (e.g. s3) would be even better. It's lower cost / higher durability, and it allows for other things "eventually", like sharing one sstable between replicas (or eventually erasure encoding pieces of data). That turns this ticket from ~easy to ~hard, because you also have to touch the read path (or, more likely, change / add a new sstablereader that can read from object storage, and then figure out how you want to upload to object storage). So "is there interest", probably, but in an s3 version of this feature, vs spinning disk. {quote} Tiering with object store is a lot more interesting and useful to me as well. I know many teams that would make use of this, and could dramatically reduce cost depending on the size of the active dataset. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Normal > Labels: doc-impacting, dtcs > Fix For: 5.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722803#comment-17722803 ] Jeff Jirsa commented on CASSANDRA-8460: --- I think a lot of people would still find it useful, however, I think since 2014, the way most people think about storage has changed. Tiering to spinning disk or cheaper block devices is fine. It's a win. It's easy to reason about - probably just implement it via compaction and all the read and write path stay exactly the same. But I think the industry trends would suggest this is suboptimal - moving this to a fast object store (e.g. s3) would be even better. It's lower cost / higher durability, and it allows for other things "eventually", like sharing one sstable between replicas (or eventually erasure encoding pieces of data). That turns this ticket from ~easy to ~hard, because you also have to touch the read path (or, more likely, change / add a new sstablereader that can read from object storage, and then figure out how you want to upload to object storage). So "is there interest", probably, but in an s3 version of this feature, vs spinning disk. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Normal > Labels: doc-impacting, dtcs > Fix For: 5.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722765#comment-17722765 ] Claude Warren commented on CASSANDRA-8460: -- Is there still interest in moving this concept forward? I am interested in exploring this option. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Normal > Labels: doc-impacting, dtcs > Fix For: 5.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443412#comment-16443412 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Hi [~rustyrazorblade], Sorry for the delay - usually a little bit tricky to get the setup right and also had a few hiccups with trunk. Over the last few weeks we've done benchmarking on AWS using 4 different 3 node clusters (each with their own dedicated stress box) - a LVM setup, a HDD setup, my code setup, and a SSD setup. The details are in here: [https://docs.google.com/document/d/164qZ3zpG5pm_j4r9yWccmMiZh6XK4LsqnZBP7Iu3gII/edit#|https://docs.google.com/document/d/164qZ3zpG5pm_j4r9yWccmMiZh6XK4LsqnZBP7Iu3gII/edit] It details the way I've stressed and the way I've setup LVM as also a write through. We've done 5 takes and in those cases it doesn't seem like LVM performs very well even when compared to the HDD. That said, the archiving code is also not as good as SSD, I think it may be related to the partition being spread across both the slow and the fast. LVM I know the volume is being used (based on cloudwatch), I guess it's kind of unintuitive for me how it may be this case because it did really well in the fio benchmark you ran. Would you (or anyone, really) like to take a look and give your thoughts? Maybe if the test is skewed against LVM and we can tune it better? Much appreciated :) > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414966#comment-16414966 ] Ben Slater commented on CASSANDRA-8460: --- OK. Talking to Lerh, his code his just about at the point where we can do some initial benchmarking so we'll run some tests to compare the two approaches and report what we get. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414820#comment-16414820 ] Jon Haddad commented on CASSANDRA-8460: --- I'm not sure what else to tell you, [~slater_ben]. You just described a perfect use case for the Linux page cache, TWCS, and dm-cache. I understand what you're trying to say - that somehow dm-cache won't cache the right data, and Cassandra will somehow do a better job than the kernel at understanding the data we need to keep hot, but so far my experience leads me to disagree with you that there would be an issue. For data that's recently been compacted & read, that data is going to be in your page cache. For data that's been recently "major compacted" in the previous TWCS window, that will either be in page cache or the dm-cache. After that, the data is just sitting around, so the access patterns will keep it either in cache or out, depending on when it's accessed. Ultimately what matters is reducing the hit miss on your disk to a minimum. You do that by keeping frequently accessed data in the cache. Using a time element (recent) without factoring in hot spots will actually get your a _worse_ cache hit ratio, which will put more pressure on the slow disks, driving up seeks, making it harder to meet the SLA. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414793#comment-16414793 ] Ben Slater commented on CASSANDRA-8460: --- Fair enough - the example was a bit of an oversimplification even for how I would have guessed it work. Having read up a bit ([https://www.redhat.com/en/blog/improving-read-performance-dm-cache)] and ([https://www.kernel.org/doc/Documentation/device-mapper/cache-policies.txt)] I suspect we've actually got a bit of a different model of the use cases we are both imagining (and I haven't done a great job of describing what I have in mind). Consider you're building an IOT application that collects sensor data and has some kind of UI for displaying readings. You want to be able to provide an experience for your users where accessing today's data (the most common use) is snappy while still providing the ability to go back in time a year but as it's not common it's fine for access to that data to be slower. In this scenario the recent data isn't "hot" in the sense that it is accessed many times (I'm not sure there is a well defined term for what it is - maybe "high priority" is better?) so it's hard for a caching algorithm (like smq) based on frequency of access to work effectively (in fact the first access is the one you want to be fast). Does that make more sense as to where I'm coming from? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414774#comment-16414774 ] Jon Haddad commented on CASSANDRA-8460: --- {quote} With LVM (possibly depending on it's rules about how and when to cache - I admit I don't know a lot about tuning possibilities there) you could end up with issues like one of your users decides to do some analysis/extract a heap old data and ends up evicting the recent data from your cache and cause what you expected to be hot data to slow down. {quote} Perhaps you should do some research about how lvmcache / dmcache actually works before making arguments against it? What you described about the cache eviction is something dmcache was specifically designed to avoid with it's smq policy. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414766#comment-16414766 ] Ben Slater commented on CASSANDRA-8460: --- I'm not sure it's necessarily easier (because you now have two separate pools of disk to manage) but I think it is more predictable - your data will be always be on the fast disk until it reaches the age you specify. With LVM (possibly depending on it's rules about how and when to cache - I admit I don't know a lot about tuning possibilities there) you could end up with issues like one of your users decides to do some analysis/extract a heap old data and ends up evicting the recent data from your cache and cause what you expected to be hot data to slow down. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414748#comment-16414748 ] Jon Haddad commented on CASSANDRA-8460: --- {quote} Thinking some more about this I think the other (and perhaps most important) advantage of implementing in Cassandra is predictability for operators. It's easy to say, for example, if I want data < 1 month old to be fast 1 need enough fast disk space for that and I know it will be consistently fast after that I need X disk space for the older data and I know it will be slower (and can even clearly tell users that). Trying to tune performance of the hot data (and avoid latency spikes) with with Cassandra + LVM sounds pretty hard. {quote} I don't see how having Cassandra manage this makes this easier. With LVM you just set up the cache, and it keeps as much hot data in the cache as it can. Maybe you only want need a month's worth of data on your hot drive. If you can't fit a month, lvm cache will manage that just fine because it's a cache for hot blocks. If you can fit 6 months on the cache, it'll do that fine too. There isn't any need for configuration, it's literally designed to handle hot data and you don't need to guess when to tier data off to your cold storage layer. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414719#comment-16414719 ] Ben Slater commented on CASSANDRA-8460: --- Think some more about this I think the other (and perhaps most important) advantage of implementing in Cassandra is predictability for operators. It's easy to say, for example, if I want data < 1 month old to be fast 1 need enough fast disk space for that and I know it will be consistently fast after that I need X disk space for the older data and I know it will be slower (and can even clearly tell users that). Trying to tune performance of the hot data (and avoid latency spikes) with with Cassandra + LVM sounds pretty hard. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414714#comment-16414714 ] Jon Haddad commented on CASSANDRA-8460: --- {quote} The requirement we're looking to target, as per the original JIRA, is people who have data that is hot for a short period but then they need to keep around for a long time with infrequent access (ie well defined rules on hot vs cold, not deciding what is hot based on what was recently read). Typically when I've seen this requirement people want: 1) The best possible performance for the hot data 2) Lowest cost of storage for the cold data It seems to me that with LVM we're a not doing the best we could in terms of either of these. {quote} If you want the best possible read performance for hot data, there's not going to be a better option than the caching layer. Treating a disk as part of the Cassandra storage pool rather than a managed cache layer by the OS introduces the need for explicit configuration and the need to explicitly manage the free data. By this I mean you will need to keep some definition in the schema or code about when to keep things on the hot disk and when to move it off. My gut tells me this will result in an under utilized disk, mostly because the more efficient you get on the fast disk the greater the risk of failure. Imagine a large compaction happening on the hot disk - this patch will need to ensure it starts moving older data off to the slow drive which is going to block compactions from happening on the hot disk. Regarding the low cost, I agree with you, duplicating the data on a cache drive is going to cost more than the aggregate of the space of the two drives. {quote} For performance, there is the write-through slow down you mentioned, depending on where you draw the line on moving to slow disk vs the final TWCS compaction you might have compactions pushing data you want to be quick out of cache and if you used EBS for both the hot disk and the slow disk you are increasing usage of the EBS bandwidth to copy to and from cache (although using local SSD as the cache negates this last one). {quote} I'm not sure how much of a problem is in practice. Cassandra's sequential writes are going to avoid a lot of performance issues related to spinning disks. In my experience the biggest performance problem limiting compaction throughput is goign to be GC pauses, not the ability to write bytes to disk. {quote} In terms of cost, with LVM the fast disk is purely being used as cache rather than a primary store so you are having to duplicate that amount of data storage - whether that is significant probably depends on your desired ratio of fast to slow disk and how cost sensitive you are. {quote} Agreed. To me, the main benefit to having the fast disk involved is the ability to increase density significantly at very low cost. If you were to have a small SSD backed by 3-5TB of slow storage, that's a pretty good win in my opinion. {quote} Whether this downsides are worth the extra complexity is of course a matter of judgement rather than facts so happy to go with the community consensus here but thought I'd put in my POV. {quote} To be clear - I'm not shooting down the patch, or saying it's a bad idea. I think there's some interesting aspects to it with some valid use cases, I'd just like everyone to be aware of existing alternatives, as I didn't see anyone bring up lvmcache in the three years this ticket has existed. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414547#comment-16414547 ] Ben Slater commented on CASSANDRA-8460: --- Hi John I've been setting the requirements from our (Instaclustr) point of view for Lerh here so I thought I'd weigh in on why I'd rather see a Cassandra based solution than LVM. The requirement we're looking to target, as per the original JIRA, is people who have data that is hot for a short period but then they need to keep around for a long time with infrequent access (ie well defined rules on hot vs cold, not deciding what is hot based on what was recently read). Typically when I've seen this requirement people want: 1) The best possible performance for the hot data 2) Lowest cost of storage for the cold data It seems to me that with LVM we're a not doing the best we could in terms of either of these. For performance, there is the write-through slow down you mentioned, depending on where you draw the line on moving to slow disk vs the final TWCS compaction you might have compactions pushing data you want to be quick out of cache and if you used EBS for both the hot disk and the slow disk you are increasing usage of the EBS bandwidth to copy to and from cache (although using local SSD as the cache negates this last one). In terms of cost, with LVM the fast disk is purely being used as cache rather than a primary store so you are having to duplicate that amount of data storage - whether that is significant probably depends on your desired ratio of fast to slow disk and how cost sensitive you are. Whether this downsides are worth the extra complexity is of course a matter of judgement rather than facts so happy to go with the community consensus here but thought I'd put in my POV. Cheers Ben > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414484#comment-16414484 ] Jon Haddad commented on CASSANDRA-8460: --- Hey [~Lerh Low], sorry for the delay. I've been running some tests on lvm cache using fio for benchmarking rather than C*. Cassandra adds a layer of complexity that won't help when it comes to raw benchmarks. I ran some tests on EBS, SSD (i2.large), and EBS using SSD as a cache volume. I ran with this simple configuration to start: {code} [global] size=10G runtime=30m directory=/bench/ bs=4k [random-read] rw=randread numjobs=4 [sequential-write] rw=write {code} ||Metric||EBS||SSD||EBS + Cache|| |Random Read IOPS|1509|5748|5347| |Random Read Bandwidth|6MB/s|22MB/s|21MB/s| |Seq Write IOPS|40K|145K|39K| |Seq Write Bandwidth|163MB/s|580MB/s|156MB/s| I've set up the cache as writethrough, meaning we're going to be bottlenecked on the slow disk for writes. Here's the setup: {code} root@ip-172-31-45-143:~# lvs -a LV VG Attr LSize PoolOrigin Data% Meta% Move Log Cpy%Sync Convert [cache] test Cwi---C--- 700.00g7.25 0.55 0.00 [cache_cdata] test Cwi-ao 700.00g [cache_cmeta] test ewi-ao 40.00g [lvol0_pmspare] test ewi--- 40.00g origin test Cwi-aoC--- 1.50t [cache] [origin_corig] 7.25 0.55 0.00 [origin_corig] test owi-aoC--- 1.50t {code} Generally speaking, TWCS uses considerably less I/O than any other strategy, and it works fine with spinning disks on EBS already, so I'm inclined to _personally_ lean towards using LVM. It doesn't require any additional configuration once the volume is set up, and as I mentioned previously it's been baked into the Linux kernel for a long time now. I haven't researched what's available on Windows, so that's something to keep in mind. I'm not opposed to research, or new features, but this seems to be to me to be adding complexity to solve a problem that's already been solved. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413369#comment-16413369 ] Lerh Chuan Low commented on CASSANDRA-8460: --- https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm Bump! Asking for kind souls to review/have a look at what I have at the moment - the branch has now been updated enough (also with tests) that should reflect a few of the features I have decided on. I've also tried to make archiving as non-invasive as possible but given the code organization (and also thinking about it intuitively, it sort of makes sense) some parts of archiving had to be known to superclasses, such as {{CompactionTask}} or {{CompactionAwareWriter}} being aware there really are 2 different directories; one for the hot and one for the cold. Highlights: - New enumeration {{DirectoryType}}. Can be either {{STANDARD}} or {{ARCHIVE}}. - The decision on whether or not something should be archived is made in {{TimeWindowCompactionStrategy}}. The decisions can be: * If somebody turns off archiving, candidates always gets put into standard * If candidates are already in archive, put them back into archiving * Otherwise, do the standard check: Are their age past the archiving time? - CSM is aware that there is such a thing as {{DirectoryType}}. It keeps a running CompactionStrategy instance in every single directory; both for archive and standard. - People can turn off archiving at any time by turning off the archiving flags in TWCS options. If they choose to do so, any archived SSTables, if compacted, will be moved back to the standard directories. Otherwise, they stay in archive. (Maybe I could write a nodetool to move archive back to standard) - If people turn on archiving, when SSTables are next compacted they are moved to archive. Also included comments in various places to try to say what I am trying to do. Finally, if you don't have time to look through the code, please at least look through just this file: {{ArchivingCompactionTest}}. All the methods have long names describing a feature of this archiving compaction, and please let me know if you disagree with any of them. There's still a lot left - dtests, Scrubber + I don't know anything about how it works with Repair etc, it needs some functional testing. Potentially also separate compaction executor and metrics and concurrent compactors. Thanks! Btw, any luck, Jon? I think I may look at writing some terraform scripts to spin up Cassandra on Debian, which may be useful for you. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396263#comment-16396263 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Read you loud and clear for stress. I favoured stress because it reports on operation and latency rates and I didn't want to dig through the code just yet on what metric exactly stress reports, rather just trusting it as it's the default cassandra ships with. I do have a custom Java class for doing inserts and reads (but it doesn't do much beyond that), let me know if you would like it...? I am also curious what metric you think would be an accurate measure, off the top of my head from the client side I can think of time from executing the query to when I receive the answer, but I'm not sure if I could make the case that the cached version is better compared to the uncached version just based off that (the alternative is dig through the stress code to look for more). I am not very familiar with the low level things in Linux so helping with FIO will be really appreciated (Doesn't help that my nodes actually run on CoreOS). I relied on Cloudwatch to verify that my cache is working. When I have time in the coming days I may write a python script to model TWCS (if you hadn't got to it then), I agree it should be modeled with TWCS which is why I was trying to make stress look like it. That said, it was interesting to see if it helped the other compaction strategies :) > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392190#comment-16392190 ] Jon Haddad commented on CASSANDRA-8460: --- I'll be honest, whenever I need to do performance testing, the last thing I reach for is stress, because I can't wrap my head around configuring it right. It's probably easier to create a ~50 line program to do the inserts. I'll try to throw something together tomorrow. Ultimately if this is going to benefit TWCS, it *has* to be tested with it, so we might as well do that up front. It's the end of the day, and I'm not an expert in setting up lvmcache, so I'll have to try it out tomorrow as well. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392035#comment-16392035 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Sure. In the commands below {{/dev/md0}} is my RAID and {{/dev/xvdf}} is my SSD volume. {code} sudo pvcreate /dev/md0 sudo vgcreate VolGroupArray /dev/md0 /dev/xvdf sudo lvcreate -n SadOldCache -L 99900M VolGroupArray /dev/xvdf sudo lvcreate -n SadOldCacheMeta -L 100M VolGroupArray /dev/xvdf sudo lvconvert --type cache-pool --poolmetadata VolGroupArray/SadOldCacheMeta VolGroupArray/SadOldCache sudo lvcreate -l 100%FREE -n RaidHDD VolGroupArray /dev/md0 sudo lvs -a vg sudo lvs -a VolGroupArray sudo lvconvert --type cache --cachepool VolGroupArray/SadOldCache VolGroupArray/RaidHDD sudo vgchange -ay {code} > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392030#comment-16392030 ] Lerh Chuan Low commented on CASSANDRA-8460: --- On other thoughts, maybe it isn't ideal to bundle it together with compactions and make a totally new {{Archiving}} Operation type. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392024#comment-16392024 ] Jon Haddad commented on CASSANDRA-8460: --- Would you mind sharing the commands you used to set up the LVM Cache? It's pretty easy to accidentally set up the pool as a balance of the 2 drives rather than using the SSD as a cache. Both implementations are going to work best with TWCS, so I think that testing the workload with TWCS using time series writes is going to be a lot more productive than random writes, as you've noted. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392001#comment-16392001 ] Lerh Chuan Low commented on CASSANDRA-8460: --- [~rustyrazorblade], About {{disk_failure_policy}}, that I am not aware so I'll have to look into it. With my current patch up to where it is now, there are also other things like Scrubber, streaming which I have yet to get to. Thanks for the heads up! If such an implication is necessary then maybe we will have to enforce it in the code. About LVM Cache, I spent some time following the man page and trying it out with Cassandra stress. I had spun up a few EC2 clusters. They were all using a raid array of 800GB each; one was SSD backed, another was magnetic HDD backed, and the final one was magnetic HDD backed with 100GB of LVM writethrough cache. I inserted ~200GB of data using cassandra-stress, waited for compactions to finish and then attempted a mixed (random) workload...the LVM performed even worse than HDD. I guess this was to be expected because the cache works best for hot data that is frequently read. I did briefly attempt a mixed workload where the queries are always trying to select the same data as much as possible (so {{gaussian(1..500M, 25000, 1000)}}), and there wasn't any noticeable difference between the LVM and HDD backed cluster. Not sure if you have used LVMCache with a workload before that worked out for you and you'd be willing to share details about it...? Just thinking about it further, the cache is also very slightly different than the original proposal. The cache duplicates the data; making Cassandra understand archiving does not. There's also a slight bonus at least from the scenario for AWS, the cache consumes the IOPS of the volumes due to the duplication (or amplifying) of read and writes back and from the cache. Any thoughts? (And thank you for your input once again :)) My clusters are still running so happy to try a few configurations if you have any to suggest, for now I'm just going to refresh myself on the code and look into getting it more presentable if someone else swings by and is willing to give their thoughts. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374864#comment-16374864 ] Jon Haddad commented on CASSANDRA-8460: --- Hey [~Lerh Low]! First off, let me thank you for being open to alternative ideas, especially after writing a large chunk of code. Not everyone is willing to take a step back and consider other options, I really appreciate it. {quote} Maybe you have stumbled upon the case where data has been resurrected in JBOD configuration in your experiences...? In theory since splitting by token range there should be no more such cases. It is safe. {quote} I had actually misremembered how CASSANDRA-6696 was implemented. Looking back at the code and testing it manually I see the memtables are flushed to their respective disks initially. It's nice to be wrong about this. There's quite a bit going on here, I did a quick search but didn't see anything related to disk failure policy. One thing that's going to be a bit tricky is unless you have a 1:1 fast disk to archive disk relationship, you end up with some weird situations that can show up when using {{disk_failure_policy: best_effort}}, which is what CASSANDRA-6696 was all about in the first place. If you lose your fast disk, will you still be able to query data that's on the archive disk for a given token range? It seems to me that using this feature would have to imply {{disk_failure_policy: stop}}, since either the failure of the archive or one of the disks in {{data_file_directories}} would result in incorrect results being returned. lvmcache uses [dm-cache|https://www.kernel.org/doc/Documentation/device-mapper/cache.txt] under the hood which keeps hot pages in memory. It shipped in Linux kernel 3.9, which was released in April 2013. Using lvmcache, if you were to create a logical volume per disk, with the SSD as your fast disk configured as a writethrough, you'd still honor the disk failure policy in the case of an archival or SSD failure, as well as have the flexibility of keeping any hot data readily available and not explicitly needing to move it off to another device when it's still active. It adapts to your read and write patterns rather than requiring configuration. Take a look at the [man page|http://man7.org/linux/man-pages/man7/lvmcache.7.html], it's pretty awesome. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373922#comment-16373922 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Hi Jon, Thanks for pitching in :) Maybe you have stumbled upon the case where data has been resurrected in JBOD configuration in your experiences...? In theory since splitting by token range there should be no more such cases. It is safe. That said, I have not heard of lvmcache so I'll go and have a look at it. I do agree that this as it is introduces a lot of code branches and complexity and simple is a feature, which is why I was seeking feedback and becoming wary...this sounds good - readily available, works for every CS and doesn't introduce all that complexity. I'll test it. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370760#comment-16370760 ] Jon Haddad commented on CASSANDRA-8460: --- Taking a look at this ticket, I've got a concern, and I'd like to suggest an alternative. Juggling multiple disks has been a bit of a pain so far and still has some weird behavior. We're a little better now that we split by token ranges, but there's still (IIRC) a point in time where the failure of a single disk can resurrect some data which had just been tombstoned. If this is fixed, apologies, but I haven't seen it. I'm not quite sure that adding complexity to this already long lasting pain point is going to help the project overall. As an alternative, it's already possible to more or less get this behavior in a fashion that works with _every_ compaction strategy. LVM (Linux only) is already ubiquitous. Using lvmcache (backed by dmcache) already provides the ability to put your cold data on the slower spinning disks and leverage SSD for fast operations. The benefit here is that you can keep a lot of your hot data on the fast drive and LVM will automatically handle making room for the newer files. A second benefit is that you are not exposing yourself to the above mentioned issues with JBOD. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357936#comment-16357936 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Bump! The branch has the latest updates I have (I've decided to go with a single CSM: https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm). I'm currently working my way through the unit tests, the weird thing is when run in isolation they work, when run together they fail, as if it's not being cleaned up properly. So far all the tests should work (as far as I can tell compared to 3.11) and I still have yet to add some tests for the archiving compaction. There are definitely a lot more things that require checking and I also haven't gotten round to checking what happens when you turn it off, etc. Just trying to get a lot of the compaction infrastructure to be aware that an archive directory exists and there's existing logic to actually perform the archiving compaction. I've also yet to test whether it's able to pick up compactions in the archive directory when there legitly exists compactions to be done in that directory. As before, comments welcome on this is going down the right path or not/is there a better way to do it. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Lerh Chuan Low >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346255#comment-16346255 ] Lerh Chuan Low commented on CASSANDRA-8460: --- I've tentatively started work on this, and it's turning out to be a relatively bigger code change than I was originally expecting, so would really love to get some feedback from the community who knows more (and review my initial patches). {{CompactionAwareWriter}}, {{DiskBoundaryManager}}, {{Directories}} and {{CompactionStrategyManager}} needs to know about archives. I've gone ahead and created a new Enumeration for `DirectoryType` that can be either ARCHIVE or STANDARD. {{CompactionAwareWriter}} always calls {{maybeSwitchWriter(Decorated Key)}} before calling {{realAppend}}. This is to handle the JBOD case, {{maybeSwitchWriter}} helps the writer write to the right location depending on the key to make sure keys do not overlap across directories. So it needs to have knowledge on which {{diskBoundaries}} it is actually using so as not to get into the situation where it can't differentiate between an actual archive disk and an actual JBOD disk. It would be wise to re-use the logic in {{diskBoundaries}} to also handle the case when the archive directory has been configured as JBOD, so {{DiskBoundaryManager}} now also needs to know about archive directories. When it tries to {{getWriteableLocations}} or generate disk boundaries, it should be able to differentiate between archive and non-archive. The same goes for {{CompactionStrategyManager}}. We still need to be able to run separate compaction strategy instances in the archive directory to handle the case of repairs and streaming (so archived SSTables don't just accumulate indefinitely). Here's where I am not sure which way to proceed forward. Option 1: Have it so that {{ColumnFamilyStore}} still only maintains one CSM and DBM and one {{Directories}}. CSM, DBM and {{Directories}} all start knowing about the existence of an archive directory; this can either be an extra field, or an EnumMap: {code} new EnumMap(Directories.DirectoryType.class){{ put(Directories.DirectoryType.STANDARD, cfs.getDiskBoundaries(Directories.DirectoryType.STANDARD)); put(Directories.DirectoryType.ARCHIVE, cfs.getDiskBoundaries(Directories.DirectoryType.ARCHIVE)); }} {code} The worry here for me is that some things may subtly break even as I fix up everything else that gets logged as errors...The CSM's own internal fields of {{repaired}}, {{unrepaired}} and {{pendingRepaired}} will also need to become maps, otherwise the individual instances will again become confused, being unable to differentiate between an actual JBOD disk or an archive disk. Some of the APIs, e.g reload, shutdown, enable etc will all need some smarts on which directory type is needed (in some cases it won't matter). Every consumer of these APIs will also need to be updated. Here's how it looks like in an initial go: https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460?expand=1 Option 2: Have it so that {{ColumnFamilyStore}} keeps 2 CSMs and 2 DBMs, of which the archiving equivalents are {{null}} if not applicable/reloaded. In this case there's a reasonable level of confidence that each CSM and BDM will just 'do the right thing', regardless whether it's an archive or not. In this case then every call to getting DBM or CSM (and there are a lot for getting CSM) will need to be evaluated and checked. Here's how it looks like in an initial go: https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm?expand=1 Both still have work on them (Scrubber, relocate SSTables, what happens when the archiving is turned off etc), but before I continue down the track, just wondering if anyone can point out which way is better/this is all misguided and , in the event this are the changes that need to happen (I can't seem to find a way for just TWCS to be aware that there's an archive directory, CFS needs to know as well), is this still worth the complexity introduced? [~pavel.trukhanov] Re "Why can't we simply allow a CS instance to spread across two disks - SSD and corresponding archival HDD" -> I think in this case you're back in the situation where you can have data resurrected. You can have other replicas compact away tombstones (because the CS can see both directories) and then have your last remaining replica, before it manages to, get its SSD with the tombstone corrupted. Upon replacing the SSD with a new one and issuing repair, the tombstone is resurrected. Of course, this can be mitigated by making it clear to operators that every time there's a corrupt disk, every single disk needs to be replaced. Even if we did so, there will still be large code changes to make CSM and DBM be able to differenti
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336502#comment-16336502 ] Lerh Chuan Low commented on CASSANDRA-8460: --- [~pavel.trukhanov] That's a really good question. I can't think of any reason why other than it just being a relic of my thoughts from JBOD/making sure unrepaired/repaired/pending repaired SSTables stay in different disks...so if the user wanted to replace just the cold archive disk they could do so. Though I'm not sure if having a separate CS actually allows that. Hmm... I guess it may become clearer to me as I dive into the code, but thank you for pointing it out :) > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335527#comment-16335527 ] Pavel Trukhanov commented on CASSANDRA-8460: Why can't we simply allow a CS instance to spread across two disks - SSD and corresponding archival HDD, so it will see all the data for any particular vnode at once and won't falsely ressurect something? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333894#comment-16333894 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Thinking about this further, looks like this will be (reasonably) complex. The main issue is that by introducing an archival directory, we now have multiple data directories, which is like a JBOD setup. https://issues.apache.org/jira/browse/CASSANDRA-6696 (Partition SSTables by token range) seeks to prevent resurrected tombstones - the scenario where you can have resurrected tombstones is described here: https://www.datastax.com/dev/blog/improving-jbod. However, with an archiving directory, we can no longer guarantee that a single token range (or vnode) will live in one directory (unless I'm missing something. Archiving is based on SSTable age; it doesn't know anything about tokens) High levelly, the situation goes like this: 1. You have a SSD and a HDD. 2. Key x is written into the SSD. 3. After some time, x passes the archive days, and ends up in the HDD. 4. For some reason not quite clear, the user decides to write a tombstone for x (They shouldn't for TWCS). So we now have tomb(x) in the SSD. At this point, we must keep in mind that there are 3 separate {{CompactionStrategy}} (CS) instances running in both the SSD and HDD, each managing repaired, unrepaired and pending repair SSTables. So there are 3 in the SSD and 3 in the HDD. These CS instances cannot see each other's candidates; when considering candidates for compaction, they see only the SSTables in their own directories. 5. It passes gc_grace_second and tomb(x) is compacted away. So now x is resurrected. In an actual JBOD setup, this can't happen because a single token range or vnode can only live in one directory. This can't be guaranteed with an archiving setup. We can solve this issue by introducing a new flag. This flag will make it so that a tombstone is only dropped if it lives in the archiving directory. Enforcing {{gc_grace > archive_days}} is not sufficient because the node can always be taken offline or compactions disabled or similar. Consider the case where: 6. The SSD is corrupted and needs to be replaced. In this case, the fix would be to replace the entire node, not just the SSD. This is to prevent tombstone resurrection but also that the system tables are gone (system tables live in the SSD), so a full replace is needed. This is the high level design we came up with: * In typical TTL use case TTL should always be greater than archive days * Introduce a new YAML setting; call it cold_data_directories possibly. This is to signal that 'archive' doesn't mean we can just forget it there; compactions still need to happen in that directory, for joining nodes, streaming nodes, and keeping the disk usage low. * An option on TWCS to specify to use cold directory after a certain amount of days. * Need a new flag to handle the situation described - cannot drop tombstones unless it’s in the cold directory. This also has the implication that we can’t drop data using tombstones on the non-archived data. Pretty much means we can’t use manual deletions on the table and we should only use this when TTLing everything, writing once, and we should turn off read repair. * Need a separate compaction throughput and concurrent compactors setting for the cold directory Caveats with changes to flags/properties: * Removing cold flag from the yaml means we've lost the data in those directories. * Removing cold flag from table only means data will no longer be archived to cold. Existing SSTables in the cold directory should be loaded in; however if compacted moved back to hot storage. * Reducing the archive time on the table will just cause more data to be moved to the cold directory. * Increasing the archive time means existing data that should no longer be archived could go back to the live set if compacted, however will stay in cold data with no negative impact. * When promoting data to cold directory need to check that there’s not an overlapping SSTable with a max timestamp greater than minimum timestamp, same as TWCS expiry. There will still be significant I/O when it comes to compacting/repairing/streaming the SSTables in the cold directory, and it adds reasonable complexity to the code base. It's not trivial to reason about either, it took 3 hours between me and my colleagues. The only leftover question we had was when changing the table level property will Cassandra need to be restarted to take effect? Or is there a hook/property checked constantly? Anybody notice anything we missed or have any thoughts on it so far on the feature itself and the value it adds for the complexity introduced (if you have time)? Before we go ahead with it. Will be really appreciated! [~krummas] [~bdeggleston] [~jjirsa] [~stone]
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331512#comment-16331512 ] Jeff Jirsa commented on CASSANDRA-8460: --- I no longer have a personal need for it, and it's not in my queue of things I plan on working on in the next 2 years. By all means, feel free to start with some of my code, but I haven't thought about specifics for quite some time. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331469#comment-16331469 ] Lerh Chuan Low commented on CASSANDRA-8460: --- Also just bumping this, wondering if you still had plans with it [~jjirsa] or [~bdeggleston]? Looks like with the patch you had previously (https://github.com/jeffjirsa/cassandra/commit/cc0ab8f733eef63ed0eaea30cc6f471b467c3ec5#diff-f628011a74763c0d0abc369bc8f5762bR126) most of the code changes are still applicable. I am willing to give it a go. It sounds like we may still be uncertain on how to go about implementing this. My original thoughts are with Jeff's, where the archive directories also keep an instance of {{XCompactionStrategy}} running for a repaired, unrepaired and pending repair set. It will still have to be read and used eventually when doing repairs or streaming when adding a new node...so it increasingly looks like it will not be ideal to put it into archiving directory and just never touch it again, though I'm happy to implement it however people think is better because there may be things that are not obvious to me. Flushing won't be aware that an archiving directory exists in this case...and will keep flushing to the actual {{data_directories}}. Eventually compaction will pick it up and toss it into {{archive_data_directories}}, if applicable. [~stone] does raise an interesting point though on making it uncoupled from CS and using a background periodic task that archives SSTables. I'm guessing in this case you would archive based on...SSTable metadata min/max timestamp? Or just the last modified of the SSTable files? It will be a YAML property and if there is an SSTable with max timestamp behind X days, archive the SSTable? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853810#comment-15853810 ] Pavel Trukhanov commented on CASSANDRA-8460: Any plans on that one? And any thoughts with regards to TWCS? [~bdeggleston] ? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: doc-impacting, dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362257#comment-15362257 ] stone commented on CASSANDRA-8460: -- a simple implement https://github.com/FS1360472174/cassandra/commit/a6b16962b6777c64d813e9d4420ac7b175efe007 > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: doc-impacting, dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354399#comment-15354399 ] stone commented on CASSANDRA-8460: -- there are several questions about this issue 1.from application perspective,we rarely used these arhived data, but when scale up cluster,add node or decommission node,we will stream data between node, since these archived sstable still in token ring,how to deal with these archived sstable.we need to access them,it may take long time to finish bootstrap when the arhived data is too large. 2.why not separate "archive sstable" from compaction compaction strategy? archive sstable is not a round,in-time task,we just need to execute the task periodly. I mean there are high coupling between compaction and archiving data. we can provide a sstable tool to archive data.split sstable by date is the job of compaction strategy. we dont care it is DTCS or TWCS. 3.in ArchivingDateTieredCompactionWriter.java we archive sstable with SSTableWriter.i just thought that why not use softlink.move sstable file,and create softlink. actually I'm not clearly about how the sstable files are moved with the method of SSTableWriter.switchWriter(). I just saw cassandra backup data with hardlink,so we can use softlink to archive data. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: doc-impacting, dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972816#comment-14972816 ] Jeff Jirsa commented on CASSANDRA-8460: --- I should probably cancel patch-available. Will need significant rebase due to CASSANDRA-8671 , and max_sstable_age_days probably isn't the right tuning knob to use assuming CASSANDRA-10280 makes it in. [~bdeggleston] - if there's a better way to implement since 8671, and you want to chat about how you'd like to see this implemented in IRC or email, I'll happily re-implement. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Jeff Jirsa > Labels: dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645490#comment-14645490 ] Jeff Jirsa commented on CASSANDRA-8460: --- Pushed to https://github.com/jeffjirsa/cassandra/tree/cassandra-8460-2.2 1) Removed the time component of archive cutoff (no more archive_sstable_age_days), and refactored it to archive at max_sstable_age_days to match your original intent rather than projecting my own intentions 2) Reworked to explicitly shortcut and return if no archive disk is present 3) Created unit test (and fixed unit tests that this patch broke, primarily in DirectoriesTest) [~krummas] - Can you review at your convenience? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Jeff Jirsa > Labels: dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598730#comment-14598730 ] Jeff Jirsa commented on CASSANDRA-8460: --- Thanks for the feedback, [~krummas]! {quote} Can't we check before starting an archive compaction if there are any archive locations available? If there are none, we shouldn't compact, right? {quote} Yea. There's a few cases here, and I suppose that answer works for all of them: - CF compaction strategy specifies archive tier, but no disk is configured on the node - CF compaction strategy specifies archive tier, but there's no free space - If we were to allowe max_sstable_age_days > archive_sstables_age_days, there could be a use case where 2 sstables on archive storage would be eligible for compaction, but there may not be room for them to be combined. If we don't allow this, then the potential edge case goes away. {quote} I guess it could be a problem if users increase max_sstable_age_days and we move the data back to the fast disks though, thoughts? {quote} Is that a problem? If the user wants to tune the parameter, we should support it. {quote} As in 2), I think we should never compact the sstables on the slow disks. {quote} I'll write it however you want it, but my assumption was that the {{max_sstable_age_days}} parameter is set and greater than {{archive_sstables_age_days}}, we would still compact, it's just obviously slower. In my mind, it's a cost/performance tradeoff for operators - slow disk may not be SUPER slow, it may just be 10k iops instead of 20k iops, so compaction may be OK, just not the best for hottest data. If you're adamant about not allowing compaction on the archive tier, I'll add a check so that {{max_sstable_age_days}} can not be set higher than {{archive_sstables_age_days}} . {quote} you should probably check absolute paths and use startsWith? {quote} Noted, I like that way better. Thanks again. I'll work on finishing this up and adding some tests. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Jeff Jirsa > Labels: dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597381#comment-14597381 ] Marcus Eriksson commented on CASSANDRA-8460: bq. 1) If compaction strategy calls for archive, but no archive disk is available (not defined or otherwise full), I'm falling back to standard disk. Agree? Can't we check before starting an archive compaction if there are any archive locations available? If there are none, we shouldn't compact, right? bq. 2) I originally planned to explicitly prohibit compaction of N files in archival disk, but I couldn't convince myself if that made sense. Instead, I'm allowing it if sstable_max_age_days allows it (if you set archive lower than max age, you could conceivably compact on archival disk tier). Agree? The way I originally envisioned this was that once an sstable hits max_sstable_age_days, we trigger a compaction that puts it on the slow disk, and then we never need to look at those sstables again (unless they eventually expire due to TTL). The idea behind max_sstable_age_days is that this is the point where we don't expect to do many reads anymore, so it would also be a good point to put them on slow disks I guess it could be a problem if users increase max_sstable_age_days and we move the data back to the fast disks though, thoughts? 3) In the case where archived sstables can still be compacted, it's possible in some windows to have them compacted with sstables on the faster standard disk. In those cases, I'm making a judgement call that if any of the source sstables were archived, the resulting sstable will also be archived. Agree? As in 2), I think we should never compact the sstables on the slow disks. 4) Finally, I was trying to determine the right way to tell if an sstable was already archived. The logic I eventually used was simply parsing the path of the sstable and seeing if it was in the array of archive directories ( https://github.com/jeffjirsa/cassandra/commit/079b22136d178937b28b82326f132e33e96f6cad#diff-894e091348f28001de5b7fe88e65733fR1665 ) . I'm not convinced this is best, but I didn't know if it was appropriate to extend sstablemetadata or similar to avoid this. Thoughts? We do something similar in Directories.java: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Directories.java#L242 - you should probably check absolute paths and use startsWith? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Jeff Jirsa > Labels: dtcs > Fix For: 3.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589084#comment-14589084 ] Jeff Jirsa commented on CASSANDRA-8460: --- Pushed a version, which I believe works as described. Would appreciate some feedback, and then if it looks promising, I'll finish it up with adding unit tests. https://github.com/jeffjirsa/cassandra/commit/079b22136d178937b28b82326f132e33e96f6cad A few explicit questions for [~krummas] and [~Bj0rn] : 1) If compaction strategy calls for archive, but no archive disk is available (not defined or otherwise full), I'm falling back to standard disk. Agree? https://github.com/jeffjirsa/cassandra/commit/079b22136d178937b28b82326f132e33e96f6cad#diff-2c2b50ecd5e8515531c5d041117c9b4fR371 2) I originally planned to explicitly prohibit compaction of N files in archival disk, but I couldn't convince myself if that made sense. Instead, I'm allowing it if sstable_max_age_days allows it (if you set archive lower than max age, you could conceivably compact on archival disk tier). Agree? 3) In the case where archived sstables can still be compacted, it's possible in some windows to have them compacted with sstables on the faster standard disk. In those cases, I'm making a judgement call that if any of the source sstables were archived, the resulting sstable will also be archived. Agree? https://github.com/jeffjirsa/cassandra/commit/079b22136d178937b28b82326f132e33e96f6cad#diff-7a9ada329d886c1871344b1d6fceec5cR56 4) Finally, I was trying to determine the right way to tell if an sstable was already archived. The logic I eventually used was simply parsing the path of the sstable and seeing if it was in the array of archive directories ( https://github.com/jeffjirsa/cassandra/commit/079b22136d178937b28b82326f132e33e96f6cad#diff-894e091348f28001de5b7fe88e65733fR1665 ) . I'm not convinced this is best, but I didn't know if it was appropriate to extend sstablemetadata or similar to avoid this. Thoughts? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Jeff Jirsa > Labels: dtcs > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583935#comment-14583935 ] Marcus Eriksson commented on CASSANDRA-8460: bq. So my initial approach was to define a second config item, separate from data_file_directories yeah lets keep it simple for now, add a new config variable like you suggest > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583809#comment-14583809 ] Jeff Jirsa commented on CASSANDRA-8460: --- {quote}yes, I've been thinking maybe adding priorities or tags to the data directories, but that is probably not needed now. Adding a flag to each data_directory that states whether it is for archival storage or not is probably enough for now.{quote} Asking for clarification to make sure I don't go too far into pony land: So my initial approach was to define a second config item, separate from {{data_file_directories}} entirely, so that no other code needed to be aware of it except for classes explicitly wanting to use `archive` tier storage ( {{dd.getAllDataFileLocations()}} would not return the archive tier, but rather add a {{dd.getArchiveDataFileLocations()}} specifically for the slow class of storage). It sounds from your description you're envisioning changing the list of data_file_locations to a list of maps {noformat} [tag1:location1,tag1:location2,tag3:location3] {noformat} or {noformat} tag1:[location1,location2],tag3:[location3] {noformat} In this case, we'd also need to maintain backwards compatibility, which seems fairly straight forward to do (check to see if the provided {{data_files_directory}} is an old-format list rather than map and apply some default tag?) The first approach is clean and isolated, unlikely to introduce surprises, but potentially limits us from being able to do more interesting work with tagged data file directories later (ie: only store data for KS W in data directories tagged X, and KS Y in data directories tagged Z). Can you clarify which best fits your expectations? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547604#comment-14547604 ] Marcus Eriksson commented on CASSANDRA-8460: bq. 1) Create a new notion of tiered storage configurable per node in yaml yes, I've been thinking maybe adding priorities or tags to the data directories, but that is probably not needed now. Adding a flag to each data_directory that states whether it is for archival storage or not is probably enough for now. bq. 2) Allow compaction strategies access to the various tiers with CASSANDRA-8671 yes, but CASSANDRA-8671 is mostly to give compaction strategies more control over flushing and streaming locations - with the CompactionAwareWriter interface added in 2.2 I think we get most of what we need for this ticket. bq. 3) Extend DTCS to take advantage of CASSANDRA-8671 + slow tier from step 1 as a compaction option sounds good to me > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546975#comment-14546975 ] Jeff Jirsa commented on CASSANDRA-8460: --- [~krummas] Does it make sense to address this in a few parts? 1) Create a new notion of tiered storage configurable per node in yaml (either one default tier for hot data {{data_file_directories}} and one tier for cold data {{archive_file_directories}}, or some form of arbitrary named tiers? ) 2) Allow compaction strategies access to the various tiers with CASSANDRA-8671 ( tagging [~bdeggleston] for visibility ) 3) Extend DTCS to take advantage of CASSANDRA-8671 + slow tier from step 1 as a compaction option such as {{WITH compaction = {'class': 'DateTieredCompactionStrategy', 'timestamp_resolution':'', 'base_time_seconds':'3600', 'max_sstable_age_days':'7', 'max_sstable_age_disk_tier':'archive' }; }} ? > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389447#comment-14389447 ] Jim Plush commented on CASSANDRA-8460: -- We also have this use case... for the PB+ size clusters where 90% of the data is cold storage and rarely used it would be nice to have some cheap spinning disks that could hold the data. Read latencies would be less of a concern do to the in-frequency of reads. > Make it possible to move non-compacting sstables to slow/big storage in DTCS > > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)