After posting this, Jon Haddad pinged me on chat and said (I'm paraphrasing):
Actually, this company I work with a lot burns through SSDs so fast it's absurd, their write amp is gigantic. This is a very good point, however it isn't what I would call typical, and a lot is going to depend on the drive manufacturer and workload. But in general, this isn't an epidemic, which is what I was trying to emphasize. Keep spares around, all drives fail, whether it's due to wear out or some other factor. If your cost of NAND/GB/Time is too high, consider moving to a higher endurance drive to replace your next round of failed units. [image: datastax_logo.png] <http://www.datastax.com/> Matt Kennedy Partner Architect | +1.703.582.5017 | matt.kenn...@datastax.com [image: linkedin.png] <https://www.linkedin.com/pub/matt-kennedy/25/258/663> [image: twitter.png] <https://twitter.com/thetweetofmatt> <https://github.com/datastax/> [image: g+.png] <https://plus.google.com/+Datastax/about> [image: facebook.png] <https://www.facebook.com/datastax> <http://feeds.feedburner.com/datastax> On Thu, Mar 10, 2016 at 10:48 AM, Matt Kennedy <mkenn...@datastax.com> wrote: > TL;DR - Cassandra actually causes a ton of write amplification but it > doesn't freaking matter any more. Read on for details... > > That slide deck does have a lot of very good information on it, but > unfortunately I think it has led to a fundamental misunderstanding about > Cassandra and write amplification. In particular, slide 51 vastly > oversimplifies the situation. > > The wikipedia definition of write amplification looks at this from the > perspective of the SSD controller: > https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value > > In short, write amplification = data written to flash/data written by the > host > > So, if I write 1MB in my application, but the SSD has to write my 1MB, > plus rearrange another 1MB of data in order to make room for it, then I've > written a total of 2MB and my write amplification is 2x. > > In other words, it is measuring how much extra the SSD controller has to > write in order to do its own housekeeping. > > However, the wikipedia definition is a bit more constrained than how the > term is used in the storage industry. The whole point of looking at write > amplification is to understand the impact that a particular workload is > going to have on the underlying NAND by virtue of the data written. So a > definition of write amplification that is a little more relevant to the > context of Cassandra is to consider this: > > write amplification = data written to flash/data written to the database > > So, while the fact that we only sequentially write large immutable > SSTables does in fact mean that controller-level write amplification is > near zero, Compaction comes along and completely destroys that tidy little > story. Think about it, every time a compaction re-writes data that has > already been written, we are creating a lot of application-level write > amplification. Different compaction strategies and the workload itself > impact what the real application-level write amp is, but generally > speaking, LCS is the worst, followed by STCS and DTCS will cause the least > write-amp. To measure this, you can usually use smartctl (may be another > mechanism depending on SSD manufacturer) to get the physical bytes written > to your SSDs and divide that by the data that you've actually logically > written to Cassandra. I've measured (more than two years ago) LCS write amp > as high as 50x on some workloads, which is significantly higher than the > typical controller level write amp on a b-tree style update-in-place data > store. Also note that the new storage engine in general reduces a lot of > inefficiency in the Cassandra storage engine therefore reducing the impact > of write amp due to compactions. > > However, if you're a person that understands SSDs, at this point you're > wondering why we aren't burning out SSDs right and left. The reality is > that general SSD endurance has gotten so good, that all this write amp > isn't really a problem any more. If you're curious to read more about that, > I recommend you start here: > > > http://hothardware.com/news/google-data-center-ssd-research-report-offers-surprising-results-slc-not-more-reliable-than-mlc-flash > > and the paper that article mentions: > > http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf > > > Hope this helps. > > > Matt Kennedy > > > > On Thu, Mar 10, 2016 at 7:05 AM, Paulo Motta <pauloricard...@gmail.com> > wrote: > >> This is a good source on Cassandra + write amplification: >> http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives >> >> 2016-03-10 9:57 GMT-03:00 Benjamin Lerer <benjamin.le...@datastax.com>: >> >>> Cassandra should not cause any write amplification. Write amplification >>> appends only when you updates data on SSDs. Cassandra does not update any >>> data in place. Data can be rewritten during compaction but it is never >>> updated. >>> >>> Benjamin >>> >>> On Thu, Mar 10, 2016 at 12:42 PM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>> > Hi Dikang, >>> > >>> > I am not sure about what you call "amplification", but as sizes highly >>> > depends on the structure I think I would probably give it a try using >>> CCM ( >>> > https://github.com/pcmanus/ccm) or some test cluster with 'production >>> > like' >>> > setting and schema. You can write a row, flush it and see how big is >>> the >>> > data cluster-wide / per node. >>> > >>> > Hope this will be of some help. >>> > >>> > C*heers, >>> > ----------------------- >>> > Alain Rodriguez - al...@thelastpickle.com >>> > France >>> > >>> > The Last Pickle - Apache Cassandra Consulting >>> > http://www.thelastpickle.com >>> > >>> > 2016-03-10 7:18 GMT+01:00 Dikang Gu <dikan...@gmail.com>: >>> > >>> > > Hello there, >>> > > >>> > > I'm wondering is there a good way to measure the write amplification >>> of >>> > > Cassandra? >>> > > >>> > > I'm thinking it could be calculated by (size of mutations written to >>> the >>> > > node)/(number of bytes written to the disk). >>> > > >>> > > Do we already have the metrics of "size of mutations written to the >>> > node"? >>> > > I did not find it in jmx metrics. >>> > > >>> > > Thanks >>> > > >>> > > -- >>> > > Dikang >>> > > >>> > > >>> > >>> >> >> >