After posting this, Jon Haddad pinged me on chat and said (I'm
paraphrasing):

Actually, this company I work with a lot burns through SSDs so fast it's
absurd, their write amp is gigantic.

This is a very good point, however it isn't what I would call typical, and
a lot is going to depend on the drive manufacturer and workload. But in
general, this isn't an epidemic, which is what I was trying to emphasize.

Keep spares around, all drives fail, whether it's due to wear out or some
other factor. If your cost of NAND/GB/Time is too high, consider moving to
a higher endurance drive to replace your next round of failed units.

[image: datastax_logo.png] <http://www.datastax.com/>

Matt Kennedy

Partner Architect | +1.703.582.5017 | matt.kenn...@datastax.com

[image: linkedin.png] <https://www.linkedin.com/pub/matt-kennedy/25/258/663>
 [image: twitter.png] <https://twitter.com/thetweetofmatt>
<https://github.com/datastax/> [image: g+.png]
<https://plus.google.com/+Datastax/about> [image: facebook.png]
<https://www.facebook.com/datastax>  <http://feeds.feedburner.com/datastax>

On Thu, Mar 10, 2016 at 10:48 AM, Matt Kennedy <mkenn...@datastax.com>
wrote:

> TL;DR - Cassandra actually causes a ton of write amplification but it
> doesn't freaking matter any more. Read on for details...
>
> That slide deck does have a lot of very good information on it, but
> unfortunately I think it has led to a fundamental misunderstanding about
> Cassandra and write amplification. In particular, slide 51 vastly
> oversimplifies the situation.
>
> The wikipedia definition of write amplification looks at this from the
> perspective of the SSD controller:
> https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value
>
> In short, write amplification = data written to flash/data written by the
> host
>
> So, if I write 1MB in my application, but the SSD has to write my 1MB,
> plus rearrange another 1MB of data in order to make room for it, then I've
> written a total of 2MB and my write amplification is 2x.
>
> In other words, it is measuring how much extra the SSD controller has to
> write in order to do its own housekeeping.
>
> However, the wikipedia definition is a bit more constrained than how the
> term is used in the storage industry. The whole point of looking at write
> amplification is to understand the impact that a particular workload is
> going to have on the underlying NAND by virtue of the data written. So a
> definition of write amplification that is a little more relevant to the
> context of Cassandra is to consider this:
>
> write amplification = data written to flash/data written to the database
>
> So, while the fact that we only sequentially write large immutable
> SSTables does in fact mean that controller-level write amplification is
> near zero, Compaction comes along and completely destroys that tidy little
> story. Think about it, every time a compaction re-writes data that has
> already been written, we are creating a lot of application-level write
> amplification. Different compaction strategies and the workload itself
> impact what the real application-level write amp is, but generally
> speaking, LCS is the worst, followed by STCS and DTCS will cause the least
> write-amp. To measure this, you can usually use smartctl (may be another
> mechanism depending on SSD manufacturer) to get the physical bytes written
> to your SSDs and divide that by the data that you've actually logically
> written to Cassandra. I've measured (more than two years ago) LCS write amp
> as high as 50x on some workloads, which is significantly higher than the
> typical controller level write amp on a b-tree style update-in-place data
> store. Also note that the new storage engine in general reduces a lot of
> inefficiency in the Cassandra storage engine therefore reducing the impact
> of write amp due to compactions.
>
> However, if you're a person that understands SSDs, at this point you're
> wondering why we aren't burning out SSDs right and left. The reality is
> that general SSD endurance has gotten so good, that all this write amp
> isn't really a problem any more. If you're curious to read more about that,
> I recommend you start here:
>
>
> http://hothardware.com/news/google-data-center-ssd-research-report-offers-surprising-results-slc-not-more-reliable-than-mlc-flash
>
> and the paper that article mentions:
>
> http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf
>
>
> Hope this helps.
>
>
> Matt Kennedy
>
>
>
> On Thu, Mar 10, 2016 at 7:05 AM, Paulo Motta <pauloricard...@gmail.com>
> wrote:
>
>> This is a good source on Cassandra + write amplification:
>> http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
>>
>> 2016-03-10 9:57 GMT-03:00 Benjamin Lerer <benjamin.le...@datastax.com>:
>>
>>> Cassandra should not cause any write amplification. Write amplification
>>> appends only when you updates data on SSDs. Cassandra does not update any
>>> data in place. Data can be rewritten during compaction but it is never
>>> updated.
>>>
>>> Benjamin
>>>
>>> On Thu, Mar 10, 2016 at 12:42 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>> > Hi Dikang,
>>> >
>>> > I am not sure about what you call "amplification", but as sizes highly
>>> > depends on the structure I think I would probably give it a try using
>>> CCM (
>>> > https://github.com/pcmanus/ccm) or some test cluster with 'production
>>> > like'
>>> > setting and schema. You can write a row, flush it and see how big is
>>> the
>>> > data cluster-wide / per node.
>>> >
>>> > Hope this will be of some help.
>>> >
>>> > C*heers,
>>> > -----------------------
>>> > Alain Rodriguez - al...@thelastpickle.com
>>> > France
>>> >
>>> > The Last Pickle - Apache Cassandra Consulting
>>> > http://www.thelastpickle.com
>>> >
>>> > 2016-03-10 7:18 GMT+01:00 Dikang Gu <dikan...@gmail.com>:
>>> >
>>> > > Hello there,
>>> > >
>>> > > I'm wondering is there a good way to measure the write amplification
>>> of
>>> > > Cassandra?
>>> > >
>>> > > I'm thinking it could be calculated by (size of mutations written to
>>> the
>>> > > node)/(number of bytes written to the disk).
>>> > >
>>> > > Do we already have the metrics of "size of mutations written to the
>>> > node"?
>>> > > I did not find it in jmx metrics.
>>> > >
>>> > > Thanks
>>> > >
>>> > > --
>>> > > Dikang
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to