Re: Btrfs/SSD
Am Tue, 16 May 2017 14:21:20 +0200 schrieb Tomasz Torcz: > On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote: > > Am Mon, 15 May 2017 22:05:05 +0200 > > schrieb Tomasz Torcz : > > > [...] > > > > > > Let me add my 2 cents. bcache-writearound does not cache writes > > > on SSD, so there are less writes overall to flash. It is said > > > to prolong the life of the flash drive. > > > I've recently switched from bcache-writeback to > > > bcache-writearound, because my SSD caching drive is at the edge > > > of it's lifetime. I'm using bcache in following configuration: > > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My > > > SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years > > > ago. > > > > > > Now, according to > > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html > > > 120GB and 250GB warranty only covers 75 TBW (terabytes written). > > > > According to your chart, all your data is written twice to bcache. > > It may have been better to buy two drives, one per mirror. I don't > > think that SSD firmwares do deduplication - so data is really > > written twice. > > I'm aware of that, but 50 GB (I've got 100GB caching partition) > is still plenty to cache my ~, some media files, two small VMs. > On the other hand I don't want to overspend. This is just a home > server. > Nb. I'm still waiting for btrfs native SSD caching, which was > planned for 3.6 kernel 5 years ago :) > ( > https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6 > ) > > > > > > > > My > > > drive has # smartctl -a /dev/sda | grep LBA 241 > > > Total_LBAs_Written 0x0032 099 099 000Old_age > > > Always - 136025596053 > > > > Doesn't say this "99%" remaining? The threshold is far from being > > reached... > > > > I'm curious, what is Wear_Leveling_Count reporting? > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 > 096 096 000Old_age Always - 18227 12 > Power_Cycle_Count 0x0032 099 099 000Old_age > Always - 29 177 Wear_Leveling_Count 0x0013 001 > 001 000Pre-fail Always - 4916 > > Is this 001 mean 1%? If so, SMART contradicts datasheets. And I > don't think I shoud see read errors for 1% wear. It more means 1% left, that is 99% wear... Most of these are counters from 100 down to zero, with THRESH being the threshold point below or at which it is considered failed or failing. Only a few values work the other way around (like temperature). Be careful with interpreting raw values: they may be very manufacturer specific and not normalized. According to Total_LBAs_Written, the manufacturer thinks the drive could still take 100x more (only 1% used). But your wear level is almost 100% (value = 001). I think that value isn't really designed around the flash cell lifetime, but intermediate components like caches. So you need to read most values "backwards": It's not a used counter, but a "what's left" counter. What does it tell you about reserved blocks usage? Note that it's sort of double negation here: value 100 used means 100% unused or 0% used... ;-) Or just insert a "minus" in front of those values and think of them counting up to zero. So on a time axis it's at -100% of the total lifetime scale and 0 is the fail point (or whatever "thresh" says). -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-05-16 08:21, Tomasz Torcz wrote: On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote: Am Mon, 15 May 2017 22:05:05 +0200 schrieb Tomasz Torcz: My drive has # smartctl -a /dev/sda | grep LBA 241 Total_LBAs_Written 0x0032 099 099 000Old_age Always - 136025596053 Doesn't say this "99%" remaining? The threshold is far from being reached... I'm curious, what is Wear_Leveling_Count reporting? ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 096 096 000Old_age Always - 18227 12 Power_Cycle_Count 0x0032 099 099 000Old_age Always - 29 177 Wear_Leveling_Count 0x0013 001 001 000Pre-fail Always - 4916 Is this 001 mean 1%? If so, SMART contradicts datasheets. And I don't think I shoud see read errors for 1% wear. The 'normalized' values shown in the VALUE, WORST, and THRESH columns usually count down to zero (with the notable exception of the thermal attributes, which usually match the raw value), they exist as a way of comparing things without having to know what vendor or model the device is, as the raw values are (again with limited exceptions) technically vendor specific (the various *_Error_Rate counters on traditional HDD's are good examples of this). VALUE is your current value, WORST is a peak-detector type thing that monitors the worst it's been, and THRESH is the point at which the device manufacturer considers that aspect failed (which will usually result in the 'Overall Health Assessment' failing as well), though I'm pretty sure that if THRESH is 000, that means that the firmware doesn't base it's asse3ssment for that attribute on the normalized value at all. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote: > Am Mon, 15 May 2017 22:05:05 +0200 > schrieb Tomasz Torcz: > > > > Yes, I considered that, too. And when I tried, there was almost no > > > perceivable performance difference between bcache-writearound and > > > bcache-writeback. But the latency of performance improvement was > > > much longer in writearound mode, so I sticked to writeback mode. > > > Also, writing random data is faster because bcache will defer it to > > > background and do writeback in sector order. Sequential access is > > > passed around bcache anyway, harddisks are already good at that. > > > > Let me add my 2 cents. bcache-writearound does not cache writes > > on SSD, so there are less writes overall to flash. It is said > > to prolong the life of the flash drive. > > I've recently switched from bcache-writeback to bcache-writearound, > > because my SSD caching drive is at the edge of it's lifetime. I'm > > using bcache in following configuration: > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD > > is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago. > > > > Now, according to > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html > > 120GB and 250GB warranty only covers 75 TBW (terabytes written). > > According to your chart, all your data is written twice to bcache. It > may have been better to buy two drives, one per mirror. I don't think > that SSD firmwares do deduplication - so data is really written twice. I'm aware of that, but 50 GB (I've got 100GB caching partition) is still plenty to cache my ~, some media files, two small VMs. On the other hand I don't want to overspend. This is just a home server. Nb. I'm still waiting for btrfs native SSD caching, which was planned for 3.6 kernel 5 years ago :) ( https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6 ) > > > > My > > drive has # smartctl -a /dev/sda | grep LBA 241 > > Total_LBAs_Written 0x0032 099 099 000Old_age > > Always - 136025596053 > > Doesn't say this "99%" remaining? The threshold is far from being > reached... > > I'm curious, what is Wear_Leveling_Count reporting? ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 096 096 000Old_age Always - 18227 12 Power_Cycle_Count 0x0032 099 099 000Old_age Always - 29 177 Wear_Leveling_Count 0x0013 001 001 000Pre-fail Always - 4916 Is this 001 mean 1%? If so, SMART contradicts datasheets. And I don't think I shoud see read errors for 1% wear. > > which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well… > > -- Tomasz Torcz ,,(...) today's high-end is tomorrow's embedded processor.'' xmpp: zdzich...@chrome.pl -- Mitchell Blank on LKML -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-05-15 15:49, Kai Krakow wrote: Am Mon, 15 May 2017 08:03:48 -0400 schrieb "Austin S. Hemmelgarn": That's why I don't trust any of my data to them. But I still want the benefit of their speed. So I use SSDs mostly as frontend caches to HDDs. This gives me big storage with fast access. Indeed, I'm using bcache successfully for this. A warm cache is almost as fast as native SSD (at least it feels almost that fast, it will be slower if you threw benchmarks at it). That's to be expected though, most benchmarks don't replicate actual usage patterns for client systems, and using SSD's for caching with bcache or dm-cache for most server workloads except a file server will usually get you a performance hit. You mean "performance boost"? Almost every read-most server workload should benefit... I file server may be the exact opposite... In my experience, short of some types of file server and non-interactive websites, read-mostly server workloads are rare. Also, I think dm-cache and bcache work very differently and are not directly comparable. Their benefit depends much on the applied workload. The low-level framework is different, and much of the internals are different, but based on most of the testing I've done, running them in the same mode (write-back/write-through/etc) will on average get you roughly the same performance. If I remember right, dm-cache is more about keeping "hot data" in the flash storage while bcache is more about reducing seeking. So dm-cache optimizes for bigger throughput of SSDs while bcache optimizes for almost-zero seek overhead of SSDs. Depending on your underlying storage, one or the other may even give zero benefit or worsen performance. Which is what I'd call a "performance hit"... I didn't ever try dm-cache, tho. For reasons I don't remember exactly, I didn't like something about how it's implemented, I think it was related to crash recovery. I don't know if that still holds true with modern kernels. It may have changed but I never looked back to revise that decision. dm-cache is a bit easier to convert to or from in-place and is in my experience a bit more flexible in data handling, but has the issue that you can still see the FS on the back-end storage (because it has no superblock or anything like that on the back-end storage), which means it's almost useless with BTRFS, and it requires a separate cache device for each back-end device (as well as an independent metadata device, but that's usually tiny since it's largely just used as a bitmap to track what blocks are clean in-cache). bcache is more complicated to set up initially, and _requires_ a kernel with bcache support to access even if you aren't doing any caching, but it masks the back-end (so it's safe to use with BTRFS (recent versions of it are at least)), and it doesn't require a 1:1 mapping of cache devices to back-end storage. It's worth noting also that on average, COW filesystems like BTRFS (or log-structured-filesystems will not benefit as much as traditional filesystems from SSD caching unless the caching is built into the filesystem itself, since they don't do in-place rewrites (so any new write by definition has to drop other data from the cache). Yes, I considered that, too. And when I tried, there was almost no perceivable performance difference between bcache-writearound and bcache-writeback. But the latency of performance improvement was much longer in writearound mode, so I sticked to writeback mode. Also, writing random data is faster because bcache will defer it to background and do writeback in sector order. Sequential access is passed around bcache anyway, harddisks are already good at that. But of course, the COW nature of btrfs will lower the hit rate I can on writes. That's why I see no benefit in using bcache-writethrough with btrfs. Yeah, on average based on my own testing, write-through mode is worthless for COW filesystems, and write-back is only worthwhile if you have a large enough cache proportionate to your bandwidth requirements (4G should be more than enough for a desktop or workstation, but servers may need huge amounts of space), while write-around is only worthwhile for stuff that needs read performance but doesn't really care about latency. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Kai Krakow posted on Mon, 15 May 2017 21:12:06 +0200 as excerpted: > Am Mon, 15 May 2017 14:09:20 +0100 > schrieb Tomasz Kusmierz: >> >> Not true. When HDD uses 10% (10% is just for easy example) of space >> as spare than aligment on disk is (US - used sector, SS - spare >> sector, BS - bad sector) >> >> US US US US US US US US US SS >> US US US US US US US US US SS >> US US US US US US US US US SS >> US US US US US US US US US SS >> US US US US US US US US US SS >> US US US US US US US US US SS >> US US US US US US US US US SS >> >> if failure occurs - drive actually shifts sectors up: >> >> US US US US US US US US US SS >> US US US BS BS BS US US US US >> US US US US US US US US US US >> US US US US US US US US US US >> US US US US US US US US US SS >> US US US BS US US US US US US >> US US US US US US US US US SS >> US US US US US US US US US SS > > This makes sense... Reserve area somehow implies it is continuous and > as such located at one far end of the platter. But your image totally > makes sense. Thanks Tomasz. It makes a lot of sense indeed, and had I thought about it I think I already "knew" it, but I simply hadn't stopped to think about it that hard, so you disabused me of the vague idea of spares all at one end of the disk, too. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 22:05:05 +0200 schrieb Tomasz Torcz: > On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote: > > > > > It's worth noting also that on average, COW filesystems like BTRFS > > > (or log-structured-filesystems will not benefit as much as > > > traditional filesystems from SSD caching unless the caching is > > > built into the filesystem itself, since they don't do in-place > > > rewrites (so any new write by definition has to drop other data > > > from the cache). > > > > Yes, I considered that, too. And when I tried, there was almost no > > perceivable performance difference between bcache-writearound and > > bcache-writeback. But the latency of performance improvement was > > much longer in writearound mode, so I sticked to writeback mode. > > Also, writing random data is faster because bcache will defer it to > > background and do writeback in sector order. Sequential access is > > passed around bcache anyway, harddisks are already good at that. > > Let me add my 2 cents. bcache-writearound does not cache writes > on SSD, so there are less writes overall to flash. It is said > to prolong the life of the flash drive. > I've recently switched from bcache-writeback to bcache-writearound, > because my SSD caching drive is at the edge of it's lifetime. I'm > using bcache in following configuration: > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD > is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago. > > Now, according to > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html > 120GB and 250GB warranty only covers 75 TBW (terabytes written). According to your chart, all your data is written twice to bcache. It may have been better to buy two drives, one per mirror. I don't think that SSD firmwares do deduplication - so data is really written twice. They may do compression but that won't be streaming compression but per-block compression, so it won't help here as a deduplicator. Also, due to internal structure, compression would probably work similar to how zswap works: By combining compressed blocks into "buddy blocks", so only compression above 2:1 will merge compressed blocks into single blocks. For most of your data, this won't be true. So effectively, this has no overall effect. For this reason, I doubt that any firmware takes the chance for compression, effects are just too low vs. the management overhead and complexity that adds to the already complicated FTL layer. > My > drive has # smartctl -a /dev/sda | grep LBA 241 > Total_LBAs_Written 0x0032 099 099 000Old_age > Always - 136025596053 Doesn't say this "99%" remaining? The threshold is far from being reached... I'm curious, what is Wear_Leveling_Count reporting? > which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well… > > [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0: > [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd > 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto > reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: > Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524] > blk_update_request: I/O error, dev sda, sector 204505785 > > Above started appearing recently. So, I was really suprised that: > - this drive is only rated for 120 TBW > - I went through this limit in only 2 years > > The workload is lightly utilised home server / media center. I think, bcache is a real SSD killer for drives around 120GB size or below... I had similar life usage with my previous small SSD just after one year. But I never had a sense error because I took it out of service early. And I switched to writearound, too. I think the write-pattern of bcache cannot be handled well by the FTL. It behaves like a log-structured file system, with new writes only appended, and sometimes a garbage collection is done by freeing complete erase blocks. Maybe it could work better, if btrfs could pass information about freed blocks down to bcache. Btrfs has a lot of these due to COW nature. I wonder if this would already be supported if turning on discard in btrfs? Does anyone know? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote: > > > It's worth noting also that on average, COW filesystems like BTRFS > > (or log-structured-filesystems will not benefit as much as > > traditional filesystems from SSD caching unless the caching is built > > into the filesystem itself, since they don't do in-place rewrites (so > > any new write by definition has to drop other data from the cache). > > Yes, I considered that, too. And when I tried, there was almost no > perceivable performance difference between bcache-writearound and > bcache-writeback. But the latency of performance improvement was much > longer in writearound mode, so I sticked to writeback mode. Also, > writing random data is faster because bcache will defer it to > background and do writeback in sector order. Sequential access is > passed around bcache anyway, harddisks are already good at that. Let me add my 2 cents. bcache-writearound does not cache writes on SSD, so there are less writes overall to flash. It is said to prolong the life of the flash drive. I've recently switched from bcache-writeback to bcache-writearound, because my SSD caching drive is at the edge of it's lifetime. I'm using bcache in following configuration: https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago. Now, according to http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html 120GB and 250GB warranty only covers 75 TBW (terabytes written). My drive has # smartctl -a /dev/sda | grep LBA 241 Total_LBAs_Written 0x0032 099 099 000Old_age Always - 136025596053 which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well… [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524] blk_update_request: I/O error, dev sda, sector 204505785 Above started appearing recently. So, I was really suprised that: - this drive is only rated for 120 TBW - I went through this limit in only 2 years The workload is lightly utilised home server / media center. -- Tomasz TorczOnly gods can safely risk perfection, xmpp: zdzich...@chrome.pl it's a dangerous thing for a man. -- Alia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 08:03:48 -0400 schrieb "Austin S. Hemmelgarn": > > That's why I don't trust any of my data to them. But I still want > > the benefit of their speed. So I use SSDs mostly as frontend caches > > to HDDs. This gives me big storage with fast access. Indeed, I'm > > using bcache successfully for this. A warm cache is almost as fast > > as native SSD (at least it feels almost that fast, it will be > > slower if you threw benchmarks at it). > That's to be expected though, most benchmarks don't replicate actual > usage patterns for client systems, and using SSD's for caching with > bcache or dm-cache for most server workloads except a file server > will usually get you a performance hit. You mean "performance boost"? Almost every read-most server workload should benefit... I file server may be the exact opposite... Also, I think dm-cache and bcache work very differently and are not directly comparable. Their benefit depends much on the applied workload. If I remember right, dm-cache is more about keeping "hot data" in the flash storage while bcache is more about reducing seeking. So dm-cache optimizes for bigger throughput of SSDs while bcache optimizes for almost-zero seek overhead of SSDs. Depending on your underlying storage, one or the other may even give zero benefit or worsen performance. Which is what I'd call a "performance hit"... I didn't ever try dm-cache, tho. For reasons I don't remember exactly, I didn't like something about how it's implemented, I think it was related to crash recovery. I don't know if that still holds true with modern kernels. It may have changed but I never looked back to revise that decision. > It's worth noting also that on average, COW filesystems like BTRFS > (or log-structured-filesystems will not benefit as much as > traditional filesystems from SSD caching unless the caching is built > into the filesystem itself, since they don't do in-place rewrites (so > any new write by definition has to drop other data from the cache). Yes, I considered that, too. And when I tried, there was almost no perceivable performance difference between bcache-writearound and bcache-writeback. But the latency of performance improvement was much longer in writearound mode, so I sticked to writeback mode. Also, writing random data is faster because bcache will defer it to background and do writeback in sector order. Sequential access is passed around bcache anyway, harddisks are already good at that. But of course, the COW nature of btrfs will lower the hit rate I can on writes. That's why I see no benefit in using bcache-writethrough with btrfs. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 07:46:01 -0400 schrieb "Austin S. Hemmelgarn": > On 2017-05-12 14:27, Kai Krakow wrote: > > Am Tue, 18 Apr 2017 15:02:42 +0200 > > schrieb Imran Geriskovan : > > > >> On 4/17/17, Austin S. Hemmelgarn wrote: > [...] > >> > >> I'm trying to have a proper understanding of what "fragmentation" > >> really means for an ssd and interrelation with wear-leveling. > >> > >> Before continuing lets remember: > >> Pages cannot be erased individually, only whole blocks can be > >> erased. The size of a NAND-flash page size can vary, and most > >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have > >> blocks of 128 or 256 pages, which means that the size of a block > >> can vary between 256 KB and 4 MB. > >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ > >> > >> Lets continue: > >> Since block sizes are between 256k-4MB, data smaller than this will > >> "probably" will not be fragmented in a reasonably empty and trimmed > >> drive. And for a brand new ssd we may speak of contiguous series > >> of blocks. > >> > >> However, as drive is used more and more and as wear leveling > >> kicking in (ie. blocks are remapped) the meaning of "contiguous > >> blocks" will erode. So any file bigger than a block size will be > >> written to blocks physically apart no matter what their block > >> addresses says. But my guess is that accessing device blocks > >> -contiguous or not- are constant time operations. So it would not > >> contribute performance issues. Right? Comments? > >> > >> So your the feeling about fragmentation/performance is probably > >> related with if the file is spread into less or more blocks. If # > >> of blocks used is higher than necessary (ie. no empty blocks can be > >> found. Instead lots of partially empty blocks have to be used > >> increasing the total # of blocks involved) then we will notice > >> performance loss. > >> > >> Additionally if the filesystem will gonna try something to reduce > >> the fragmentation for the blocks, it should precisely know where > >> those blocks are located. Then how about ssd block informations? > >> Are they available and do filesystems use it? > >> > >> Anyway if you can provide some more details about your experiences > >> on this we can probably have better view on the issue. > > > > What you really want for SSD is not defragmented files but > > defragmented free space. That increases life time. > > > > So, defragmentation on SSD makes sense if it cares more about free > > space but not file data itself. > > > > But of course, over time, fragmentation of file data (be it meta > > data or content data) may introduce overhead - and in btrfs it > > probably really makes a difference if I scan through some of the > > past posts. > > > > I don't think it is important for the file system to know where the > > SSD FTL located a data block. It's just important to keep > > everything nicely aligned with erase block sizes, reduce rewrite > > patterns, and free up complete erase blocks as good as possible. > > > > Maybe such a process should be called "compaction" and not > > "defragmentation". In the end, the more continuous blocks of free > > space there are, the better the chance for proper wear leveling. > > There is one other thing to consider though. From a practical > perspective, performance on an SSD is a function of the number of > requests and what else is happening in the background. The second > aspect isn't easy to eliminate on most systems, but the first is > pretty easy to mitigate by defragmenting data. > > Reiterating the example I made elsewhere in the thread: > Assume you have an SSD and storage controller that can use DMA to > transfer up to 16MB of data off of the disk in a single operation. > If you need to load a 16MB file off of this disk and it's properly > aligned (it usually will be with most modern filesystems if the > partition is properly aligned) and defragmented, it will take exactly > one operation (assuming that doesn't get interrupted). By contrast, > if you have 16 fragments of 1MB each, that will take at minimum 2 > operations, and more likely 15-16 (depends on where everything is > on-disk, and how smart the driver is about minimizing the number of > required operations). Each request has some amount of overhead to > set up and complete, so the first case (one single extent) will take > less total time to transfer the data than the second one. > > This particular effect actually impacts almost any data transfer, not > just pulling data off of an SSD (this is why jumbo frames are > important for high-performance networking, and why a higher latency > timer on the PCI bus will improve performance (but conversely > increase latency)), even when fetching data from a traditional hard > drive (but it's not very noticeable there unless your fragments are > tightly grouped,
Re: Btrfs/SSD
Am Mon, 15 May 2017 14:09:20 +0100 schrieb Tomasz Kusmierz: > > Traditional hard drives usually do this too these days (they've > > been under-provisioned since before SSD's existed), which is part > > of why older disks tend to be noisier and slower (the reserved > > space is usually at the far inside or outside of the platter, so > > using sectors from there to replace stuff leads to long seeks). > > Not true. When HDD uses 10% (10% is just for easy example) of space > as spare than aligment on disk is (US - used sector, SS - spare > sector, BS - bad sector) > > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > > if failure occurs - drive actually shifts sectors up: > > US US US US US US US US US SS > US US US BS BS BS US US US US > US US US US US US US US US US > US US US US US US US US US US > US US US US US US US US US SS > US US US BS US US US US US US > US US US US US US US US US SS > US US US US US US US US US SS This makes sense... Reserve area somehow implies it is continuous and as such located at one far end of the platter. But your image totally makes sense. > that strategy is in place to actually mitigate the problem that > you’ve described, actually it was in place since drives were using > PATA :) so if your drive get’s nosier over time it’s either a broken > bearing or demagnetised arm magnet causing it to not aim propperly - > so drive have to readjust position multiple times before hitting a > right track -- To unsubscribe from this list: send the line > "unsubscribe linux-btrfs" in the body of a message to > majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html I can confirm that such drives usually do not get noisier except there's something broken other than just a few sectors. And faulty bearing in notebook drives is the most often scenario I see. I always recommend to replace such drives early because they will usually fail completely. Such notebooks are good candidates for SSD replacements btw. ;-) The demagnetised arm magnet is an interesting error scenario - didn't think of it. Thanks for the pointer. But still, there's one noise you can easily identify as bad sectors: When the drive starts clicking for 30 or more seconds while trying to read data, and usually also freezes the OS during that time. Such drives can be "repaired" by rewriting the offending sectors (because it will be moved to reserve area then). But I guess it's best to already replace such a drive by that time. Early, back in PATA times, I often had harddisks exposing seemingly bad sectors when power was cut while the drive was writing data. I usually used dd to rewrite such sectors and the drive was good as new again - except I lost some file data maybe. Luckily, modern drives don't show such behavior. And also SSDs learned to handle this... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
> Traditional hard drives usually do this too these days (they've been > under-provisioned since before SSD's existed), which is part of why older > disks tend to be noisier and slower (the reserved space is usually at the far > inside or outside of the platter, so using sectors from there to replace > stuff leads to long seeks). Not true. When HDD uses 10% (10% is just for easy example) of space as spare than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector) US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS if failure occurs - drive actually shifts sectors up: US US US US US US US US US SS US US US BS BS BS US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US SS US US US BS US US US US US US US US US US US US US US US SS US US US US US US US US US SS that strategy is in place to actually mitigate the problem that you’ve described, actually it was in place since drives were using PATA :) so if your drive get’s nosier over time it’s either a broken bearing or demagnetised arm magnet causing it to not aim propperly - so drive have to readjust position multiple times before hitting a right track -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-05-12 14:36, Kai Krakow wrote: Am Fri, 12 May 2017 15:02:20 +0200 schrieb Imran Geriskovan: On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: FWIW, I'm in the market for SSDs ATM, and remembered this from a couple weeks ago so went back to find it. Thanks. =:^) (I'm currently still on quarter-TB generation ssds, plus spinning rust for the larger media partition and backups, and want to be rid of the spinning rust, so am looking at half-TB to TB, which seems to be the pricing sweet spot these days anyway.) Since you are taking ssds to mainstream based on your experience, I guess your perception of data retension/reliability is better than that of spinning rust. Right? Can you eloborate? Or an other criteria might be physical constraints of spinning rust on notebooks which dictates that you should handle the device with care when running. What was your primary motivation other than performance? Personally, I don't really trust SSDs so much. They are much more robust when it comes to physical damage because there are no physical parts. That's absolutely not my concern. Regarding this, I trust SSDs better than HDDs. My concern is with fail scenarios of some SSDs which die unexpected and horribly. I found some reports of older Samsung SSDs which failed suddenly and unexpected, and in a way that the drive completely died: No more data access, everything gone. HDDs start with bad sectors and there's a good chance I can recover most of the data except a few sectors. Older is the key here. Some early SSD's did indeed behave like that, but most modern ones do generally show signs that they will fail in the near future. There's also the fact that traditional hard drives _do_ fail like that sometimes, even without rough treatment. When SSD blocks die, they are probably huge compared to a sector (256kB to 4MB usually because that's erase block sizes). If this happens, the firmware may decide to either allow read-only access or completely deny access. There's another situation where dying storage chips may completely mess up the firmware and there's no longer any access to data. I've yet to see an SSD that blocks user access to an erase block. Almost every one I've seen will instead rewrite the block (possibly with the corrupted data intact (that is, without mangling it further)) to one of the reserve blocks, and then just update it's internal mapping so that the old block doesn't get used, and the new one is pointing to the right place. Some of the really good SSD's even use erasure coding in the FTL for data verification instead of CRC's, so they can actually reconstruct the missing bits when they do this. Traditional hard drives usually do this too these days (they've been under-provisioned since before SSD's existed), which is part of why older disks tend to be noisier and slower (the reserved space is usually at the far inside or outside of the platter, so using sectors from there to replace stuff leads to long seeks). That's why I don't trust any of my data to them. But I still want the benefit of their speed. So I use SSDs mostly as frontend caches to HDDs. This gives me big storage with fast access. Indeed, I'm using bcache successfully for this. A warm cache is almost as fast as native SSD (at least it feels almost that fast, it will be slower if you threw benchmarks at it). That's to be expected though, most benchmarks don't replicate actual usage patterns for client systems, and using SSD's for caching with bcache or dm-cache for most server workloads except a file server will usually get you a performance hit. It's worth noting also that on average, COW filesystems like BTRFS (or log-structured-filesystems will not benefit as much as traditional filesystems from SSD caching unless the caching is built into the filesystem itself, since they don't do in-place rewrites (so any new write by definition has to drop other data from the cache). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-05-12 14:27, Kai Krakow wrote: Am Tue, 18 Apr 2017 15:02:42 +0200 schrieb Imran Geriskovan: On 4/17/17, Austin S. Hemmelgarn wrote: Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. I'm trying to have a proper understanding of what "fragmentation" really means for an ssd and interrelation with wear-leveling. Before continuing lets remember: Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ Lets continue: Since block sizes are between 256k-4MB, data smaller than this will "probably" will not be fragmented in a reasonably empty and trimmed drive. And for a brand new ssd we may speak of contiguous series of blocks. However, as drive is used more and more and as wear leveling kicking in (ie. blocks are remapped) the meaning of "contiguous blocks" will erode. So any file bigger than a block size will be written to blocks physically apart no matter what their block addresses says. But my guess is that accessing device blocks -contiguous or not- are constant time operations. So it would not contribute performance issues. Right? Comments? So your the feeling about fragmentation/performance is probably related with if the file is spread into less or more blocks. If # of blocks used is higher than necessary (ie. no empty blocks can be found. Instead lots of partially empty blocks have to be used increasing the total # of blocks involved) then we will notice performance loss. Additionally if the filesystem will gonna try something to reduce the fragmentation for the blocks, it should precisely know where those blocks are located. Then how about ssd block informations? Are they available and do filesystems use it? Anyway if you can provide some more details about your experiences on this we can probably have better view on the issue. What you really want for SSD is not defragmented files but defragmented free space. That increases life time. So, defragmentation on SSD makes sense if it cares more about free space but not file data itself. But of course, over time, fragmentation of file data (be it meta data or content data) may introduce overhead - and in btrfs it probably really makes a difference if I scan through some of the past posts. I don't think it is important for the file system to know where the SSD FTL located a data block. It's just important to keep everything nicely aligned with erase block sizes, reduce rewrite patterns, and free up complete erase blocks as good as possible. Maybe such a process should be called "compaction" and not "defragmentation". In the end, the more continuous blocks of free space there are, the better the chance for proper wear leveling. There is one other thing to consider though. From a practical perspective, performance on an SSD is a function of the number of requests and what else is happening in the background. The second aspect isn't easy to eliminate on most systems, but the first is pretty easy to mitigate by defragmenting data. Reiterating the example I made elsewhere in the thread: Assume you have an SSD and storage controller that can use DMA to transfer up to 16MB of data off of the disk in a single operation. If you need to load a 16MB file off of this disk and it's properly aligned (it usually will be with most modern filesystems if the partition is properly aligned) and defragmented, it will take exactly one operation (assuming that doesn't get interrupted). By contrast, if you have 16 fragments of 1MB each, that will take at minimum 2 operations, and more likely 15-16 (depends on where everything is on-disk, and how smart the driver is about minimizing the number of required operations). Each request has some amount of overhead to set up and complete, so the first case (one single extent) will take less total time to transfer the data than the second one. This particular effect actually impacts almost any data transfer, not just pulling data off of an SSD (this is why jumbo frames are important for high-performance networking, and why a higher latency timer on the PCI bus will improve performance (but conversely increase latency)), even when fetching data from a traditional hard drive (but it's not very noticeable there unless your fragments are tightly grouped, because seek latency dominates
Re: Btrfs/SSD
On 5/15/17, Tomasz Kusmierzwrote: > Theoretically all sectors in over provision are erased - practically they > are either erased or waiting to be erased or broken. > Over provisioned area does have more uses than that. For example if you have > a 1TB drive where you store 500GB of data that you never modify -> SSD will > copy part of that data to over provisioned area -> free sectors that were > unwritten for a while -> free sectors that were continuously hammered by > writes and write a static data there. This mechanism is wear levelling - it > means that SSD internals make sure that sectors on SSD have an equal use > over time. Despite of some thinking that it’s pointless imagine situation > where you’ve got a 1TB drive with 1GB free and you keep writing and > modifying data in this 1GB free … those sectors will quickly die due to > short flash life expectancy ( some as short as 1k erases ! ). Thanks for the info. It can be understood that, the drive has a pool of erase blocks from which some portion (say %90-95) is provided as usable. Trimmed blocks are candidates for new allocations. If the drive is not trimmed, that allocatable pool becomes smaller than it can be and new allocations under wear levelling logic is done from smaller group. This will probably increase data traffic on that "small group" of blocks, eating from their erase cycles. However, this logic is valid if the drive does NOT move data on trimmed blocks to trimmed/available ones. Under some advanced wear leveling operations, drive may decide to swap two blocks (one occupied/one vacant) if the cummulative erase cycles of the former is much lower than the latter to provide some balancing affect. Theoretically swapping may even occur when the flash tend to lose charge (and thus data) based on the age of the data and/or block health. But in any case I understand that trimming will provide important degree of freedom and health to the drive. Without trimming drive will continue to deal with worthless blocks simply because it doesn't know they are worthless... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovanwrote: > > On 5/14/17, Tomasz Kusmierz wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovanwrote: > > On 5/14/17, Tomasz Kusmierz wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 5/14/17, Tomasz Kusmierzwrote: > In terms of over provisioning of SSD it’s a give and take relationship … on > good drive there is enough over provisioning to allow a normal operation on > systems without TRIM … now if you would use a 1TB drive daily without TRIM > and have only 30GB stored on it you will have fantastic performance but if > you will want to store 500GB at roughly 200GB you will hit a brick wall and > you writes will slow dow to megabytes / s … this is symptom of drive running > out of over provisioning space … What exactly happens on a non-trimmed drive? Does it begin to forge certain erase-blocks? If so which are those? What happens when you never trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD (my -o ssd "summary")
On 05/14/2017 08:01 PM, Tomasz Kusmierz wrote: > All stuff that Chris wrote holds true, I just wanted to add flash > specific information (from my experience of writing low level code > for operating flash) Thanks! > [... erase ...] > In terms of over provisioning of SSD it’s a give and take > relationship … on good drive there is enough over provisioning to > allow a normal operation on systems without TRIM … now if you would > use a 1TB drive daily without TRIM and have only 30GB stored on it > you will have fantastic performance but if you will want to store > 500GB at roughly 200GB you will hit a brick wall and you writes will > slow dow to megabytes / s … this is symptom of drive running out of > over provisioning space … if you would run OS that issues trim, this > problem would not exist since drive would know that whole 970GB of > space is free and it would be pre-emptively erased days before. == ssd_spread == The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. == ssd == [And, since I didn't write a "summary post" about this issue yet, here is my version of it:] The default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with writing and deleting many files that are not too big also causes this pattern, ending up with the physical address space fully allocated and written to. My favourite videos about this: *) ssd (write pattern is small increments in /var/log/mail.log, a mail spool on /var/spool/postfix (lots of file adds and deletes), and mailman archives with a lot of little files): https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 *) The picture uses Hilbert Curve ordering (see link below) and shows the four last created DATA block groups appended together. (so a new chunk allocation pushes the others back in the picture). https://github.com/knorrie/btrfs-heatmap/blob/master/doc/curves.md * What the ssd mode does, is simply setting a lower boundary to the size of free space fragments that are reused. * In combination with always trying to walk forward inside a block group, not looking back at freed up space, it fills up with a shotgun blast pattern when you do writes and deletes all the time. * When a write comes in that is bigger than any free space part left behind, a new chunk gets allocated, and the bad pattern continues in there. * Because it keeps allocating more and more new chunks, and keeps circling around in the latest one, until a big write is done, it leaves mostly empty ones behind. * Without 'discard', the SSD will never learn that all the free space left behind is actually free. * Eventually all raw disk space is allocated, and users run into problems with ENOSPC and balance etc. So, enabling this ssd mode actually means it starts choking itself to death here. When users see this effect, they start scheduling balance operations, to compact free space to bring the amount of allocated but unused space down a bit. * But, doing that is causing just more and more writes to the ssd. * Also, since balance takes a "usage" argument and not a "how badly fragmented" argument, it's causing lots of unnecessary rewriting of data. * And, with a decent amount (like a few thousand) subvolumes, all having a few snapshots of their own, the ratio data:metadata written during balance is skyrocketing, causing not only the data to be rewritten, but also causing pushing out lots of metadata to the ssd. (example: on my backup server rewriting 1GiB of data causes writing of >40GiB of metadata, where probably 99.99% of those writes are some kind of intermediary writes which are immediately invalidated during the next btrfs transaction that is done). All in all, this reminds me of the series "breaking bad", where every step taken to try fix things, only made things worse. At every bullet point above, this is also happening. == nossd == nossd mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 > And last part - hard drive is not aware of filesystem and partitions > … so you could have 400GB on this 1TB drive left unpartitioned and > still you would be cooked. Technically speaking using as much as > possible space on a SSD to a FS and OS that supports trim will give > you best performance because drive will be notified of as much as > possible disk space that is actually free ….. > > So, to summaries: > - don’t try to outsmart built in mechanics of SSD (people that > suggest that are just morons that want to have 5 minutes of fame
Re: Btrfs/SSD
All stuff that Chris wrote holds true, I just wanted to add flash specific information (from my experience of writing low level code for operating flash) So with flash, to erase you have to erase a large allocation block, usually it used to be 128kB (plus some crc data and stuff makes more than 128kB, but we are talking functional data storage space) on never setups it can be megabytes … device dependant really. To erase a block you need to provide whole 128 x 8 bits with voltage higher that is usually used for IO (can be even 15V) so it requires an external supply or build in internal charge pump to provide that voltage to a block erasure circuitry. This process generates a lot of heat and requires a lot of energy, so consensus back in the day was that you could erase one block at a time and this could take up to 200ms (0.2 second). After a erase you need to check whenever all bits are set to 1 (charged state) and then sector is marked as ready for storage. Of course, flash memories are moving forward and in more demanding environments there are solutions where blocks are grouped into groups which have separate eraser circuits that will allow errasure to be performed in parallel in multiple parts of flash module, still you are bound to one per group. Another problem is that erasure procedure locally does increase temperature and on flat flashes it’s not that much of a problem, but on emerging solutions like 3d flashed locally we might experience undesired temperature increases that would either degrade life span of flash or simply erase neighbouring blocks. In terms of over provisioning of SSD it’s a give and take relationship … on good drive there is enough over provisioning to allow a normal operation on systems without TRIM … now if you would use a 1TB drive daily without TRIM and have only 30GB stored on it you will have fantastic performance but if you will want to store 500GB at roughly 200GB you will hit a brick wall and you writes will slow dow to megabytes / s … this is symptom of drive running out of over provisioning space … if you would run OS that issues trim, this problem would not exist since drive would know that whole 970GB of space is free and it would be pre-emptively erased days before. And last part - hard drive is not aware of filesystem and partitions … so you could have 400GB on this 1TB drive left unpartitioned and still you would be cooked. Technically speaking using as much as possible space on a SSD to a FS and OS that supports trim will give you best performance because drive will be notified of as much as possible disk space that is actually free ….. So, to summaries: - don’t try to outsmart built in mechanics of SSD (people that suggest that are just morons that want to have 5 minutes of fame). - don’t buy crap SSD and expect it to behave like good one if you use below certain % of it … it’s stupid, buy more reasonable SSD but smaller and store slow data on spinning rust. - read more books and wikipedia, not jumping down on you but internet is filled with people that provide false information, sometimes unknowingly and swear by it ( Dunning–Kruger effect :D ) and some of them are very good and making all theories sexy and stuff … you simply have to get used to it… - if something is to good to be true, than it’s not - promise of future performance gains is a domain of “sleazy salesman" > On 14 May 2017, at 17:21, Chris Murphywrote: > > On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote: > >> When I was doing my ssd research the first time around, the going >> recommendation was to keep 20-33% of the total space on the ssd entirely >> unallocated, allowing it to use that space as an FTL erase-block >> management pool. > > Any brand name SSD has its own reserve above its specified size to > ensure that there's decent performance, even when there is no trim > hinting supplied by the OS; and thereby the SSD can only depend on LBA > "overwrites" to know what blocks are to be freed up. > > >> Anyway, that 20-33% left entirely unallocated/unpartitioned >> recommendation still holds, right? > > Not that I'm aware of. I've never done this by literally walling off > space that I won't use. IA fairly large percentage of my partitions > have free space so it does effectively happen as far as the SSD is > concerned. And I use fstrim timer. Most of the file systems support > trim. > > Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file > system that would not issue trim commands on this drive, and it was > doing full performance writes through that point. Then deleted maybe > 5% of the files, and then refill the drive to 98% again, and it was > the same performance. So it must have had enough in reserve to permit > full performance "overwrites" which were in effect directed to reserve > blocks as the freed up blocks were being erased. Thus the erasure > happening on the fly
Re: Btrfs/SSD
On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote: > When I was doing my ssd research the first time around, the going > recommendation was to keep 20-33% of the total space on the ssd entirely > unallocated, allowing it to use that space as an FTL erase-block > management pool. Any brand name SSD has its own reserve above its specified size to ensure that there's decent performance, even when there is no trim hinting supplied by the OS; and thereby the SSD can only depend on LBA "overwrites" to know what blocks are to be freed up. > Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? Not that I'm aware of. I've never done this by literally walling off space that I won't use. IA fairly large percentage of my partitions have free space so it does effectively happen as far as the SSD is concerned. And I use fstrim timer. Most of the file systems support trim. Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file system that would not issue trim commands on this drive, and it was doing full performance writes through that point. Then deleted maybe 5% of the files, and then refill the drive to 98% again, and it was the same performance. So it must have had enough in reserve to permit full performance "overwrites" which were in effect directed to reserve blocks as the freed up blocks were being erased. Thus the erasure happening on the fly was not inhibiting performance on this SSD. Now had I gone to 99.9% full, and then delete say 1GiB, and then started going a bunch of heavy small file writes rather than sequential? I don't know what would happening, it might have choked because this is a lot more work for the SSD to deal with heavy IOPS and erasure. It will invariably be something that's very model and even firmware version specific. > Am I correct in asserting that if one > is following that, the FTL already has plenty of erase-blocks available > for management and the discussion about filesystem level trim and free > space management becomes much less urgent, tho of course it's still worth > considering if it's convenient to do so? Most file systems don't direct writes to new areas, they're fairly prone to overwriting. So the firmware is going to get notified fairly quickly with either trim or an overwrite, which LBAs are stale. It's probably more important with Btrfs which has more variable behavior, it can continue to direct new writes to recently allocated chunks before it'll do overwrites in older chunks that have free space. > And am I also correct in believing that while it's not really worth > spending more to over-provision to the near 50% as I ended up doing, if > things work out that way as they did with me because the difference in > price between 30% overprovisioning and 50% overprovisioning ends up being > trivial, there's really not much need to worry about active filesystem > trim at all, because the FTL has effectively half the device left to play > erase-block musical chairs with as it decides it needs to? I think it's not worth to overprovision by default ever. Use all of that space until you have a problem. If you have a 256G drive, you paid to get the spec performance for 100% of those 256G. You did not pay that company to second guess things and have cut it slack by overprovisioning from the outset. I don't know how long it takes for erasure to happen though, so I have no idea how much overprovisioning is really needed at the write rate of the drive, so that it can erase at the same rate as writes, in order to avoid a slow down. I guess an even worse test would be one that intentionally fragments across erase block boundaries, forcing the firmware to be unable to do erasures without first migrating partially full blocks in order to make them empty, so they can then be erased, and now be used for new writes. That sort of shuffling is what will separate the good from average drives, and why the drives have multicore CPUs on them, as well as most now having on the fly always on encryption. Even completely empty, some of these drives have a short term higher speed write which falls back to a lower speed as the fast flash gets full. After some pause that fast write capability is restored for future writes. I have no idea if this is separate kind of flash on the drive, or if it's just a difference in encoding data onto the flash that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D VNAND. That sounds like an encoding method; it's fast but inefficient and probably needs reencoding. But that's the thing, the firmware is really complicated now. I kinda wonder if f2fs could be chopped down to become a modular allocator for the existing file systems; activate that allocation method with "ssd" mount option rather than whatever overly smart thing it does today that's based on assumptions that are now likely outdated. -- Chris Murphy -- To unsubscribe from this list: send the line
Re: Btrfs/SSD
Imran Geriskovan posted on Fri, 12 May 2017 15:02:20 +0200 as excerpted: > On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: >> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple >> weeks ago so went back to find it. Thanks. =:^) >> >> (I'm currently still on quarter-TB generation ssds, plus spinning rust >> for the larger media partition and backups, and want to be rid of the >> spinning rust, so am looking at half-TB to TB, which seems to be the >> pricing sweet spot these days anyway.) > > Since you are taking ssds to mainstream based on your experience, > I guess your perception of data retension/reliability is better than > that of spinning rust. Right? Can you eloborate? > > Or an other criteria might be physical constraints of spinning rust on > notebooks which dictates that you should handle the device with care > when running. > > What was your primary motivation other than performance? Well, the /immediate/ motivation is that the spinning rust is starting to hint that it's time to start thinking about rotating it out of service... It's my main workstation so wall powered, but because it's the media and secondary backups partitions, I don't have anything from it mounted most of the time and because it /is/ spinning rust, I allow it to spin down. It spins right back up if I mount it, and reads seem to be fine, but if I let it set a bit after mount, possibly due to it spinning down again, sometimes I get write errors, SATA resets, etc. Sometimes the write will then eventually appear to go thru, sometimes not, but once this happens, unmounting often times out, and upon a remount (which may or may not work until a clean reboot), the last writes may or may not still be there. And the smart info, while not bad, does indicate it's starting to age, tho not extremely so. Now even a year ago I'd have likely played with it, adjusting timeouts, spindowns, etc, attempting to get it working normally again. But they say that ssd performance spoils you and you don't want to go back, and while it's a media drive and performance isn't normally an issue, those secondary backups to it as spinning rust sure take a lot longer than the primary backups to other partitions on the same pair of ssds that the working copies (of everything but media) are on. Which means I don't like to do them... which means sometimes I put them off longer than I should. Basically, it's another application of my "don't make it so big it takes so long to maintain you don't do it as you should" rule, only here, it's not the size but rather because I've been spoiled by the performance of the ssds. So couple the aging spinning rust with the fact that I've really wanted to put media and the backups on ssd all along, only it couldn't be cost- justified a few years ago when I bought the original ssds, and I now have my excuse to get the now cheaper ssds I really wanted all along. =:^) As for reliability... For archival usage I still think spinning rust is more reliable, and certainly more cost effective. However, for me at least, with some real-world ssd experience under my belt now, including an early slow failure (more and more blocks going bad, I deliberately kept running it in btrfs raid1 mode with scrubs handling the bad blocks for quite some time, just to get the experience both with ssds and with btrfs) and replacement of one of the ssds with one I had originally bought for a different machine (my netbook, which went missing shortly thereafter), I now find ssds reliable enough for normal usage, certainly so if the data is valuable enough to have backups of it anyway, and if it's not valuable enough to be worth doing backups, then losing it is obviously not a big deal, because it's self-evidently worth less than the time, trouble and resources of doing that backup. Particularly so if the speed of ssds helpfully encourages you to keep the backups more current than you would otherwise. =:^) But spinning rust remains appropriate for long-term archival usage, like that third-level last-resort backup I like to make, then keep on the shelf, or store with a friend, or in a safe deposit box, or whatever, and basically never use, but like to have just in case. IOW, that almost certainly write once, read-never, seldom update, last resort backup. If three years down the line there's a fire/flood/whatever, and all I can find in the ashes/mud or retrieve from that friend is that three year old backup, I'll be glad to still have it. Of course those who have multi-TB scale data needs may still find spinning rust useful as well, because while 4-TB ssds are available now, they're /horribly/ expensive. But with 3D-NAND, even that use-case looks like it may go ssd in the next five years or so, leaving multi-year to decade-plus archiving, and perhaps say 50-TB-plus, but that's going to take long enough to actually write or otherwise do anything with it's effectively
[OT] SSD performance patterns (was: Btrfs/SSD)
Am Sat, 13 May 2017 09:39:39 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted: > > > In the end, the more continuous blocks of free space there are, the > > better the chance for proper wear leveling. > > Talking about which... > > When I was doing my ssd research the first time around, the going > recommendation was to keep 20-33% of the total space on the ssd > entirely unallocated, allowing it to use that space as an FTL > erase-block management pool. > > At the time, I added up all my "performance matters" data dirs and > allowing for reasonable in-filesystem free-space, decided I could fit > it in 64 GB if I had to, tho 80 GB would be a more comfortable fit, > so allowing for the above entirely unpartitioned/unused slackspace > recommendations, had a target of 120-128 GB, with a reasonable range > depending on actual availability of 100-160 GB. > > It turned out, due to pricing and availability, I ended up spending > somewhat more and getting 256 GB (238.5 GiB). Of course that allowed > me much more flexibility than I had expected and I ended up with > basically everything but the media partition on the ssds, PLUS I > still left them at only just over 50% partitioned, (using the gdisk > figures, 51%- partitioned, 49%+ free). I put by ESP (for UEFI) onto the SSD and also played with putting swap onto it dedicated to hibernation. But I discarded the hibernation idea and removed the swap because it didn't work well: It wasn't much faster then waking from HDD, and hibernation is not that reliable anyways. Also, hybrid hibernation is not yet integrated into KDE so I stick to sleep mode currently. The rest of my SSD (also 500GB) is dedicated to bcache. This fits my complete work set of daily work with hit ratios going up to 90% and beyond. My filesystem boots and feels like SSD, the HDDs are almost silent and still my file system is 3TB on 3x 1TB HDD. > Given that, I've not enabled btrfs trim/discard (which saved me from > the bugs with it a few kernel cycles ago), and while I do have a > weekly fstrim systemd timer setup, I've not had to be too concerned > about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was > known not to be trimming everything it really should have been. This is a good recommendation as TRIM is still a slow operation because Queued TRIM is not used for most drives due to buggy firmware. So you not only circumvent kernel and firmware bugs, but also get better performance that way. > Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? Am I correct in asserting that if > one is following that, the FTL already has plenty of erase-blocks > available for management and the discussion about filesystem level > trim and free space management becomes much less urgent, tho of > course it's still worth considering if it's convenient to do so? > > And am I also correct in believing that while it's not really worth > spending more to over-provision to the near 50% as I ended up doing, > if things work out that way as they did with me because the > difference in price between 30% overprovisioning and 50% > overprovisioning ends up being trivial, there's really not much need > to worry about active filesystem trim at all, because the FTL has > effectively half the device left to play erase-block musical chairs > with as it decides it needs to? I think, things may have changed since long ago. See below. But it certainly depends on which drive manufacturer you chose, I guess. I can at least confirm that bigger drives wear their write cycles much slower, even when filled up. My old 128MB Crucial drive was worn after only 1 year (I swapped it early, I kept an eye on SMART numbers). My 500GB Samsung drive is around 1 year old now, I do write a lot more data to it, but according to SMART it should work for at least 5 to 7 more years. By that time, I probably already swapped it for a bigger drive. So I guess you should maybe look at your SMART numbers and calculate the expected life time: Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE)) with WLC = Wear_Leveling_Count should get you the expected remaining power on hours. My drive is powered on 24/7 most of the time but if you power your drive only 8 hours per day, you can easily ramp up the life time by three times of days vs. me. ;-) There is also Total_LBAs_Written but that, at least for me, usually gives much higher lifetime values so I'd stick with the pessimistic ones. Even when WLC goes to zero, the drive should still have reserved blocks available. My drive sets the threshold to 0 for WLC which makes me think that it is not fatal when it hits 0 because the drive still has reserved blocks. And for reserved blocks, the threshold is 10%. Now combine that with your planning of getting a new drive, and you can optimize space efficiency vs. lifetime better. > Of course the higher per-GiB cost
Re: Btrfs/SSD
> Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? I never liked that idea. And I really disliked how people considered it to be (and even passed it down as) some magical, absolute stupid-proof fail-safe thing (because it's not). 1: Unless you reliably trim the whole LBA space (and/or run ata_secure_erase on the whole drive) before you (re-)partition the LBA space, you have zero guarantee that the drive's controller/firmware will treat the unallocated space as empty or will keep it's content around as useful data (even if it's full of zeros because zero could be very useful data unless it's specifically marked as "throwaway" by trim/erase). On the other hand, a trim-compatible filesystem should properly mark (trim) all (or at least most of) the free space as free (= free to erase internally by the controller's discretion). And even if trim isn't fail-proof either, those bugs should be temporary (and it's not like a sane SSD will die in a few weeks due to these kind of issues during sane usage and crazy drivers will often fail under crazy usage regardless of trim and spare space). 2: It's not some daemon-summoning, world-ending catastrophe if you occasionally happen to fill your SSD to ~100%. It probably won't like it (it will probably get slow by the end of the writes and the internal write amplification might skyrocket at it's peak) but nothing extraordinary will happen and normal operation (high write speed, normal internal write amplification, etc) should resume soon after you make some room (for example, you delete your temporary files or move some old content to an archive storage and you properly trim that space). That space is there to be used, just don't leave it close to 100% all the time and try never leaving it close to 100% when you plan to keep it busy with many small random writes. 3: Some drives have plenty of hidden internal spare space (especially the expensive kinds offered for datacenters or "enthusiast" consumers by big companies like Intel and such). Even some cheap drivers might have plenty of erased space at 100% LBA allocation if they use compression internally (and you don't fill it up to 100% with in-compressible content). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Sat, 13 May 2017 14:52:47 +0500 schrieb Roman Mamedov: > On Fri, 12 May 2017 20:36:44 +0200 > Kai Krakow wrote: > > > My concern is with fail scenarios of some SSDs which die unexpected > > and horribly. I found some reports of older Samsung SSDs which > > failed suddenly and unexpected, and in a way that the drive > > completely died: No more data access, everything gone. HDDs start > > with bad sectors and there's a good chance I can recover most of > > the data except a few sectors. > > Just have your backups up-to-date, doesn't matter if it's SSD, HDD or > any sort of RAID. > > In a way it's even better, that SSDs [are said to] fail abruptly and > entirely. You can then just restore from backups and go on. Whereas a > failing HDD can leave you puzzled on e.g. whether it's a cable or > controller problem instead, and possibly can even cause some data > corruption which you won't notice until too late. My current backup strategy can handle this. I never backup files from the source again if it didn't change by timestamp. That way, silent data corruption won't creep into the backup. Additionally, I keep a backlog of 5 years of file history. Even if a corrupted file creeps into the backup, there is enough time to get a good copy back. If it's older, it probably doesn't hurt so much anyway. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Fri, 12 May 2017 20:36:44 +0200 Kai Krakowwrote: > My concern is with fail scenarios of some SSDs which die unexpected and > horribly. I found some reports of older Samsung SSDs which failed > suddenly and unexpected, and in a way that the drive completely died: > No more data access, everything gone. HDDs start with bad sectors and > there's a good chance I can recover most of the data except a few > sectors. Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort of RAID. In a way it's even better, that SSDs [are said to] fail abruptly and entirely. You can then just restore from backups and go on. Whereas a failing HDD can leave you puzzled on e.g. whether it's a cable or controller problem instead, and possibly can even cause some data corruption which you won't notice until too late. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted: > In the end, the more continuous blocks of free space there are, the > better the chance for proper wear leveling. Talking about which... When I was doing my ssd research the first time around, the going recommendation was to keep 20-33% of the total space on the ssd entirely unallocated, allowing it to use that space as an FTL erase-block management pool. At the time, I added up all my "performance matters" data dirs and allowing for reasonable in-filesystem free-space, decided I could fit it in 64 GB if I had to, tho 80 GB would be a more comfortable fit, so allowing for the above entirely unpartitioned/unused slackspace recommendations, had a target of 120-128 GB, with a reasonable range depending on actual availability of 100-160 GB. It turned out, due to pricing and availability, I ended up spending somewhat more and getting 256 GB (238.5 GiB). Of course that allowed me much more flexibility than I had expected and I ended up with basically everything but the media partition on the ssds, PLUS I still left them at only just over 50% partitioned, (using the gdisk figures, 51%- partitioned, 49%+ free). Given that, I've not enabled btrfs trim/discard (which saved me from the bugs with it a few kernel cycles ago), and while I do have a weekly fstrim systemd timer setup, I've not had to be too concerned about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was known not to be trimming everything it really should have been. Anyway, that 20-33% left entirely unallocated/unpartitioned recommendation still holds, right? Am I correct in asserting that if one is following that, the FTL already has plenty of erase-blocks available for management and the discussion about filesystem level trim and free space management becomes much less urgent, tho of course it's still worth considering if it's convenient to do so? And am I also correct in believing that while it's not really worth spending more to over-provision to the near 50% as I ended up doing, if things work out that way as they did with me because the difference in price between 30% overprovisioning and 50% overprovisioning ends up being trivial, there's really not much need to worry about active filesystem trim at all, because the FTL has effectively half the device left to play erase-block musical chairs with as it decides it needs to? Of course the higher per-GiB cost of ssd as compared to spinning rust does mean that the above overprovisioning recommendation really does hurt, most of the time, driving per-usable-GB costs even higher, and as I recall that was definitely the case back then between 80 GiB and 160 GiB, and it was basically an accident of timing, that I was buying just as the manufactures flooded the market with newly cost-effective 256 GB devices, that meant they were only trivially more expensive than the 128 or 160 GB, AND unlike the smaller devices, actually /available/ in the 500-ish MB/sec performance range that (for SATA-based SSDs) is actually capped by SATA-600 bus speeds more than the chips themselves. (There were lower cost 128 GB devices, but they were lower speed than I wanted, too.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 5/12/17, Kai Krakowwrote: > I don't think it is important for the file system to know where the SSD > FTL located a data block. It's just important to keep everything nicely > aligned with erase block sizes, reduce rewrite patterns, and free up > complete erase blocks as good as possible. Yeah. "Tight packing" of data into erase blocks will reduce fragmentation at flash level, but not necessarily the fragmentation at fs level. And unless we are writing in continuous journaling style (as f2fs ?), we still need to have some info about the erase blocks. Of course while these are going on, there is also something like roundrobin mapping or some kind of journaling would be going on at the low level flash as wear leveling/bad block replacements which is totally invisible to us. > Maybe such a process should be called "compaction" and not > "defragmentation". In the end, the more continuous blocks of free space > there are, the better the chance for proper wear leveling. Tight packing into erase blocks seems dominant factor for ssd welfare. However, fs fragmentation may still be a thing to consider because increased fs fragmentation will probably increase the # of erase blocks involved, affecting both read/write performance and wear. Keeping an eye on both is a though job. Worse there is "two" uncoordinated eyes one watching the "fs" and the other watching the "flash" making the whole process suboptimal. I think the ultimate utopic combination would be "absolutely dumb flash controller" providing direct access to physical bytes and the ultimate "Flash FS" making use of possible performance, wear leveling tricks. Clearly, we are far from it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Fri, 12 May 2017 15:02:20 +0200 schrieb Imran Geriskovan: > On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: > > FWIW, I'm in the market for SSDs ATM, and remembered this from a > > couple weeks ago so went back to find it. Thanks. =:^) > > > > (I'm currently still on quarter-TB generation ssds, plus spinning > > rust for the larger media partition and backups, and want to be rid > > of the spinning rust, so am looking at half-TB to TB, which seems > > to be the pricing sweet spot these days anyway.) > > Since you are taking ssds to mainstream based on your experience, > I guess your perception of data retension/reliability is better than > that of spinning rust. Right? Can you eloborate? > > Or an other criteria might be physical constraints of spinning rust > on notebooks which dictates that you should handle the device > with care when running. > > What was your primary motivation other than performance? Personally, I don't really trust SSDs so much. They are much more robust when it comes to physical damage because there are no physical parts. That's absolutely not my concern. Regarding this, I trust SSDs better than HDDs. My concern is with fail scenarios of some SSDs which die unexpected and horribly. I found some reports of older Samsung SSDs which failed suddenly and unexpected, and in a way that the drive completely died: No more data access, everything gone. HDDs start with bad sectors and there's a good chance I can recover most of the data except a few sectors. When SSD blocks die, they are probably huge compared to a sector (256kB to 4MB usually because that's erase block sizes). If this happens, the firmware may decide to either allow read-only access or completely deny access. There's another situation where dying storage chips may completely mess up the firmware and there's no longer any access to data. That's why I don't trust any of my data to them. But I still want the benefit of their speed. So I use SSDs mostly as frontend caches to HDDs. This gives me big storage with fast access. Indeed, I'm using bcache successfully for this. A warm cache is almost as fast as native SSD (at least it feels almost that fast, it will be slower if you threw benchmarks at it). -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Tue, 18 Apr 2017 15:02:42 +0200 schrieb Imran Geriskovan: > On 4/17/17, Austin S. Hemmelgarn wrote: > > Regarding BTRFS specifically: > > * Given my recently newfound understanding of what the 'ssd' mount > > option actually does, I'm inclined to recommend that people who are > > using high-end SSD's _NOT_ use it as it will heavily increase > > fragmentation and will likely have near zero impact on actual device > > lifetime (but may _hurt_ performance). It will still probably help > > with mid and low-end SSD's. > > I'm trying to have a proper understanding of what "fragmentation" > really means for an ssd and interrelation with wear-leveling. > > Before continuing lets remember: > Pages cannot be erased individually, only whole blocks can be erased. > The size of a NAND-flash page size can vary, and most drive have pages > of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 > pages, which means that the size of a block can vary between 256 KB > and 4 MB. > codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ > > Lets continue: > Since block sizes are between 256k-4MB, data smaller than this will > "probably" will not be fragmented in a reasonably empty and trimmed > drive. And for a brand new ssd we may speak of contiguous series > of blocks. > > However, as drive is used more and more and as wear leveling kicking > in (ie. blocks are remapped) the meaning of "contiguous blocks" will > erode. So any file bigger than a block size will be written to blocks > physically apart no matter what their block addresses says. But my > guess is that accessing device blocks -contiguous or not- are > constant time operations. So it would not contribute performance > issues. Right? Comments? > > So your the feeling about fragmentation/performance is probably > related with if the file is spread into less or more blocks. If # of > blocks used is higher than necessary (ie. no empty blocks can be > found. Instead lots of partially empty blocks have to be used > increasing the total # of blocks involved) then we will notice > performance loss. > > Additionally if the filesystem will gonna try something to reduce > the fragmentation for the blocks, it should precisely know where > those blocks are located. Then how about ssd block informations? > Are they available and do filesystems use it? > > Anyway if you can provide some more details about your experiences > on this we can probably have better view on the issue. What you really want for SSD is not defragmented files but defragmented free space. That increases life time. So, defragmentation on SSD makes sense if it cares more about free space but not file data itself. But of course, over time, fragmentation of file data (be it meta data or content data) may introduce overhead - and in btrfs it probably really makes a difference if I scan through some of the past posts. I don't think it is important for the file system to know where the SSD FTL located a data block. It's just important to keep everything nicely aligned with erase block sizes, reduce rewrite patterns, and free up complete erase blocks as good as possible. Maybe such a process should be called "compaction" and not "defragmentation". In the end, the more continuous blocks of free space there are, the better the chance for proper wear leveling. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: > FWIW, I'm in the market for SSDs ATM, and remembered this from a couple > weeks ago so went back to find it. Thanks. =:^) > > (I'm currently still on quarter-TB generation ssds, plus spinning rust > for the larger media partition and backups, and want to be rid of the > spinning rust, so am looking at half-TB to TB, which seems to be the > pricing sweet spot these days anyway.) Since you are taking ssds to mainstream based on your experience, I guess your perception of data retension/reliability is better than that of spinning rust. Right? Can you eloborate? Or an other criteria might be physical constraints of spinning rust on notebooks which dictates that you should handle the device with care when running. What was your primary motivation other than performance? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Austin S. Hemmelgarn posted on Mon, 17 Apr 2017 07:53:04 -0400 as excerpted: > * In my personal experience, Intel, Samsung, and Crucial appear to be > the best name brands (in relative order of quality). I have personally > had bad experiences with SanDisk and Kingston SSD's, but I don't have > anything beyond circumstantial evidence indicating that it was anything > but bad luck on both counts. FWIW, I'm in the market for SSDs ATM, and remembered this from a couple weeks ago so went back to find it. Thanks. =:^) (I'm currently still on quarter-TB generation ssds, plus spinning rust for the larger media partition and backups, and want to be rid of the spinning rust, so am looking at half-TB to TB, which seems to be the pricing sweet spot these days anyway.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 4:55 PM, Hans van Kranenburg <hans.van.kranenb...@mendix.com> wrote: > On 04/17/2017 09:22 PM, Imran Geriskovan wrote: >> [...] >> >> Going over the thread following questions come to my mind: >> >> - What exactly does btrfs ssd option does relative to plain mode? > > There's quite an amount of information in the the very recent threads: > - "About free space fragmentation, metadata write amplification and (no)ssd" > - "BTRFS as a GlusterFS storage back-end, and what I've learned from > using it as such." > - "btrfs filesystem keeps allocating new chunks for no apparent reason" > - ... and a few more > > I suspect there will be some "summary" mails at some point, but for now, > I'd recommend crawling through these threads first. > > And now for your instant satisfaction, a short visual guide to the > difference, which shows actual btrfs behaviour instead of our guesswork > around it (taken from the second mail thread just mentioned): > > -o ssd: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 > > -o nossd: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 I'm uncertain from these if the option affects both metadata and data writes, or just data. The latter makes some sense, if you think a given data write event contains related files and thus increases the chance when those files are deleted of having a mostly freed up erase block. That way wear leveling is doing less work. For metadata writes it makes less sense to me, and is inconsistent with what I've seen from metadata chunk allocation. Pretty much anything means dozens or more 16K nodes are being COWd. e.g. a 2KiB write to systemd journal, even preallocated, means adding an EXTENT DATA item, one of maybe 200 per node, which means that whole node must be COWd, and whatever its parent is must be written (ROOT ITEM I think) and then tree root, and then super block. I see generally 30 16K nodes modified in about 4 minutes with average logging. Even if it's 1 change per 4 minutes, and all 30 nodes get written to one 2MB block, and then that block isn't ever written to again, the metadata chunk would be growing and I don't see that. For weeks or months I see a 512MB metadata chunk and it doesn't ever get bigger than this. Anyway, I think ssd mount option still sounds plausibly useful. What I'm skeptical of on SSD is defragmenting without compression, and also nocow. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-18 09:02, Imran Geriskovan wrote: On 4/17/17, Austin S. Hemmelgarnwrote: Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. I'm trying to have a proper understanding of what "fragmentation" really means for an ssd and interrelation with wear-leveling. Before continuing lets remember: Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ Lets continue: Since block sizes are between 256k-4MB, data smaller than this will "probably" will not be fragmented in a reasonably empty and trimmed drive. And for a brand new ssd we may speak of contiguous series of blocks. We're slightly talking past each other here. I'm referring to fragmentation on the filesystem level. This impacts performance on SSD's because it necessitates a larger number of IO operations to read the data off of the device (which is also the case on traditional HDD's, but it has near zero impact there compared to the seek latency). You appear to be referring to fragmentation at the level of the flash-translation layer (FTL), which is present in almost any SSD, and should have near zero impact on performance if the device has good firmware and a decent controller. However, as drive is used more and more and as wear leveling kicking in (ie. blocks are remapped) the meaning of "contiguous blocks" will erode. So any file bigger than a block size will be written to blocks physically apart no matter what their block addresses says. But my guess is that accessing device blocks -contiguous or not- are constant time operations. So it would not contribute performance issues. Right? Comments? Correct. So your the feeling about fragmentation/performance is probably related with if the file is spread into less or more blocks. If # of blocks used is higher than necessary (ie. no empty blocks can be found. Instead lots of partially empty blocks have to be used increasing the total # of blocks involved) then we will notice performance loss. Kind of. As an example, consider a 16MB file on a device that can read up to 16MB of data in a single read operation (arbitrary numbers chose to make math easier). If you copy that file onto the device while it's idle and has a block of free space 16MB in size, it will end up as one extent (in BTRFS at least, and probably also in most other extent-based filesystems). In that case, it will take 1 read operation to read the whole file into memory. If instead that file gets created with multiple extents that aren't right next to each other on disk, you will need a number of read operation equal to the number of extents to read the file into memory. The performance loss I'm referring to when talking about fragmentation is the result of the increased number of read operations required to read a file with a larger number of extents into memory. It actually has nothing to do with whether or not the device is an SSD, a HDD, a DVD, NVRAM, SPI NOR flash, an SD card, or any other storage device, it just has more impact on storage devices that have zero seek latency because the seek latency usually far exceeds the overhead of the extra read operations. Additionally if the filesystem will gonna try something to reduce the fragmentation for the blocks, it should precisely know where those blocks are located. Then how about ssd block informations? Are they available and do filesystems use it? Anyway if you can provide some more details about your experiences on this we can probably have better view on the issue. * Files with NOCOW and filesystems with 'nodatacow' set will both hurt performance for BTRFS on SSD's, and appear to reduce the lifetime of the SSD. This and other experinces tell us it is still possible to "forge some blocks of ssd". How could this be possible if there is wear-leveling? Two alternatives comes to mind: - If there is no empty (trimmed) blocks left on the ssd, it will have no chance other than forging the block. How about its reserve blocks? Are they exhausted too? Or are they only used as bad block replacements? - No proper wear-levelling is accually done by the drive. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-17 15:22, Imran Geriskovan wrote: On 4/17/17, Roman Mamedov <r...@romanrm.net> wrote: "Austin S. Hemmelgarn" <ahferro...@gmail.com> wrote: * Compression should help performance and device lifetime most of the time, unless your CPU is fully utilized on a regular basis (in which case it will hurt performance, but still improve device lifetimes). Days are long gone since the end user had to ever think about device lifetimes with SSDs. Refer to endurance studies such as It has been demonstrated that all SSDs on the market tend to overshoot even their rated TBW by several times, as a result it will take any user literally dozens of years to wear out the flash no matter which filesystem or what settings used. And most certainly it's not worth it changing anything significant in your workflow (such as enabling compression if it's otherwise inconvenient or not needed) just to save the SSD lifetime. Going over the thread following questions come to my mind: - What exactly does btrfs ssd option does relative to plain mode? Assuming I understand what it does correctly, it prioritizes writing into larger, 2MB aligned chunks of free-space, whereas normal mode goes for 64k alignment. - Most(all?) SSDs employ wear leveling. Isn't it? That is they are constrantly remapping their blocks under the hood. So isn't it meaningless to speak of some kind of a block forging/fragmentation/etc.. affect of any writing pattern? Because making one big I/O request to fetch a file is faster than a bunch of small ones. If your file is all in one extent in the filesystem, it takes less work to copy to memory than if you're pulling form a dozen places on the device. This doesn't have much impact on light workloads, but when you're looking at heavy server workloads, it's big. - If it is so, Doesn't it mean that there is no better ssd usage strategy other than minimizing the total bytes written? That is whatever we do, if it contributes to this fact it is good, otherwise bad. Are all other things are beyond any user control? Is there a recommended setting? As a general strategy, yes, that appears to be the case. ON a specific SSD, it may not be. For example, on the Crucial MX300's I have in most of my systems, the 'ssd' mount option actually makes things slower by anywhere from 2-10%. - How about "data retension" experiences? It is known that new ssds can hold data safely for longer period. As they age that margin gets shorter. As an extreme case if I write into a new ssd and shelve it, can i get back my data back after 5 years? How about a file written 5 years ago and never touched again although rest of the ssd is in active use during that period? - Yes may be lifetimes getting irrelevant. However TBW has still direct relation with data retension capability. Knowing that writing more data to a ssd can reduce the "life time of your data" is something strange. Explaining this and your comment above requires a bit of understanding of how flash memory actually works. The general structure of a single cell is that of a field-effect transistor (almost always a MOSFET) with a floating gate which consists of a bit of material electrically isolated from the rest of the transistor. Data is stored by trapping electrons on this floating gate, but getting them there requires a strong enough current to break through the insulating layer that keeps it isolated from the rest of the transistor. This process breaks down the insulating layer over time, making it easier for the electrons trapped in the floating gate to leak back into the rest of the transistor, thus losing data. Aside from the write-based degradation of the insulating layer, there are other things that can cause it to break down or for the electrons to leak out, including very high temperatures (we're talking industrial temperatures here, not the type you're likely to see in most consumer electronics), strong electromagnetic fields (again, we're talking _really_ strong here, not stuff you're likely to see in most consumer electronics), cosmic background radiation, and even noise from other nearby cells being rewritten (known as a read disturb error, only an issue in NAND flash (but that's what all SSD's are these days)). - But someone can come and say: Hey don't worry about "data retension years". Because your ssd will already be dead before data retension becomes a problem for you... Which is relieving.. :)) Anyway what are your opinions? On this in particular, my opinion is that that claim is bogus unless you have an SSD designed to brick itself after a fixed period of time. That statement is about the same as saying that you don't need to worry about uncorrectable errors in ECC RAM because you'll lose entire chips before they ever happen. In both cases, you should indeed be worrying more about catastrophic failure, but that's because it will have a bigger impa
Re: Btrfs/SSD
On Tue, Apr 18, 2017 at 07:31:34AM -0400, Austin S. Hemmelgarn wrote: > On 2017-04-17 15:39, Chris Murphy wrote: > >On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn > >wrote: > >>On 2017-04-17 14:34, Chris Murphy wrote: [...] > >It's almost like we need these things to not fsync at all, and just > >rely on the filesystem commit time... > > > Essentially yes, but that causes all kinds of other problems. > >>> > >>> > >>>Drat. > >>> > >>Admittedly most of the problems are use-case specific (you can't afford to > >>lose transactions in a financial database for example, so it functionally > >>has to call fsync after each transaction), but most of it stems from the > >>fact that BTRFS is doing a lot of the same stuff that much of the 'problem' > >>software is doing itself internally. > >> > > > >Seems like the old way of doing things, and the staleness of the > >internet, have colluded to create a lot of nervousness and misuse of > >fsync. The very fact Btrfs needs a log tree to deal with fsync's in a > >semi-sane way... > Except that BTRFS is somewhat unusual. Prior to this, the only > 'mainstream' filesystem that provided most of these features was > ZFS, and that does a good enough job that this doesn't matter. > > For something like a database though, where you need ACID > guarantees, you pretty much have to have COW semantics internally, > and you have to force things to stable storage after each > transaction that actually modifies data. Looking at it another way, > most database storage formats are essentially record-oriented > filesystems (as opposed to block-oriented filesystems that most > people think of). This is part of why you see such similar access > patterns in databases and VM disk images (even if the VM isn't > running database software), they are essentially doing the same > things at a low level. I remember thinking, when I was learning about the internals of btrfs, that it looked an awful lot like the high-level description of the internals of Oracle which I'd just been learning about. Most of the same pieces, doing mostly the same kinds operations to achieve the same effective results. Hugo. -- Hugo Mills | Don't worry, he's not drunk. He's like that all the hugo@... carfax.org.uk | time. http://carfax.org.uk/ | PGP: E2AB1DE4 | A.H. Deakin signature.asc Description: Digital signature
Re: Btrfs/SSD
On 2017-04-17 15:39, Chris Murphy wrote: On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarnwrote: On 2017-04-17 14:34, Chris Murphy wrote: Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. Ah, apologies for my misunderstanding. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. There have only been a couple of changes in the write patterns that I know of, but I would double check that the values for Seal and Compress in the journald.conf file are the same, as I know for a fact that changing those does change the write patterns (not much, but they do change). Same, unchanged defaults on both systems. #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m #RateLimitIntervalSec=30s #RateLimitBurst=1000 The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly constant hits every 2-5 seconds on the journal file; using filefrag. I'm sure there's a better way to trace a single file being read/written to than this, but... AIUI, the sync interval is like BTRFS's commit interval, the journal file is guaranteed to be 100% consistent at least once every seconds. As far as tracing, I think it's possible to do some kind of filtering with btrace so you just see a specific file, but I'm not certain. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems. Drat. Admittedly most of the problems are use-case specific (you can't afford to lose transactions in a financial database for example, so it functionally has to call fsync after each transaction), but most of it stems from the fact that BTRFS is doing a lot of the same stuff that much of the 'problem' software is doing itself internally. Seems like the old way of doing things, and the staleness of the internet, have colluded to create a lot of nervousness and misuse of fsync. The very fact Btrfs needs a log tree to deal with fsync's in a semi-sane way... Except that BTRFS is somewhat unusual. Prior to this, the only 'mainstream' filesystem that provided most of these features was ZFS, and that does a good enough job that this doesn't matter. For something like a database though, where you need ACID guarantees, you pretty much have to have COW semantics internally, and you have to force things to stable storage after each transaction that actually modifies data. Looking at it another way, most database storage formats are essentially record-oriented filesystems (as opposed to block-oriented filesystems that most people think of). This is part of why you see such similar access patterns in databases and VM disk images (even if the VM isn't running database software), they are essentially doing the same things at a low level. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Tue, 18 Apr 2017 03:23:13 + (UTC) Duncan <1i5t5.dun...@cox.net> wrote: > Without reading the links... > > Are you /sure/ it's /all/ ssds currently on the market? Or are you > thinking narrowly, those actually sold as ssds? > > Because all I've read (and I admit I may not actually be current, but...) > on for instance sd cards, certainly ssds by definition, says they're > still very write-cycle sensitive -- very simple FTL with little FTL wear- > leveling. > > And AFAIK, USB thumb drives tend to be in the middle, moderately complex > FTL with some, somewhat simplistic, wear-leveling. > If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be SSDs "by definition", but nobody will think of an SD card when you say "I bought an SSD for my computer". And yes, SD card and USB flash sticks are commonly understood to be much simpler and more brittle devices than full blown desktop (not to mention server) SSDs. > While the stuff actually marketed as SSDs, generally SATA or direct PCIE/ > NVME connected, may indeed match your argument, no real end-user concern > necessary any more as the FTLs are advanced enough that user or > filesystem level write-cycle concerns simply aren't necessary these days. > > > So does that claim that write-cycle concerns simply don't apply to modern > ssds, also apply to common thumb drives and sd cards? Because these are > certainly ssds both technically and by btrfs standards. > -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Roman Mamedov posted on Mon, 17 Apr 2017 23:24:19 +0500 as excerpted: > Days are long gone since the end user had to ever think about device > lifetimes with SSDs. Refer to endurance studies such as > http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre- all-dead > http://ssdendurancetest.com/ > https://3dnews.ru/938764/ > It has been demonstrated that all SSDs on the market tend to overshoot > even their rated TBW by several times, as a result it will take any user > literally dozens of years to wear out the flash no matter which > filesystem or what settings used Without reading the links... Are you /sure/ it's /all/ ssds currently on the market? Or are you thinking narrowly, those actually sold as ssds? Because all I've read (and I admit I may not actually be current, but...) on for instance sd cards, certainly ssds by definition, says they're still very write-cycle sensitive -- very simple FTL with little FTL wear- leveling. And AFAIK, USB thumb drives tend to be in the middle, moderately complex FTL with some, somewhat simplistic, wear-leveling. While the stuff actually marketed as SSDs, generally SATA or direct PCIE/ NVME connected, may indeed match your argument, no real end-user concern necessary any more as the FTLs are advanced enough that user or filesystem level write-cycle concerns simply aren't necessary these days. So does that claim that write-cycle concerns simply don't apply to modern ssds, also apply to common thumb drives and sd cards? Because these are certainly ssds both technically and by btrfs standards. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 04/17/2017 09:22 PM, Imran Geriskovan wrote: > [...] > > Going over the thread following questions come to my mind: > > - What exactly does btrfs ssd option does relative to plain mode? There's quite an amount of information in the the very recent threads: - "About free space fragmentation, metadata write amplification and (no)ssd" - "BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such." - "btrfs filesystem keeps allocating new chunks for no apparent reason" - ... and a few more I suspect there will be some "summary" mails at some point, but for now, I'd recommend crawling through these threads first. And now for your instant satisfaction, a short visual guide to the difference, which shows actual btrfs behaviour instead of our guesswork around it (taken from the second mail thread just mentioned): -o ssd: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 -o nossd: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarnwrote: > On 2017-04-17 14:34, Chris Murphy wrote: >> Nope. The first paragraph applies to NVMe machine with ssd mount >> option. Few fragments. >> >> The second paragraph applies to SD Card machine with ssd_spread mount >> option. Many fragments. > > Ah, apologies for my misunderstanding. >> >> >> These are different versions of systemd-journald so I can't completely >> rule out a difference in write behavior. > > There have only been a couple of changes in the write patterns that I know > of, but I would double check that the values for Seal and Compress in the > journald.conf file are the same, as I know for a fact that changing those > does change the write patterns (not much, but they do change). Same, unchanged defaults on both systems. #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m #RateLimitIntervalSec=30s #RateLimitBurst=1000 The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly constant hits every 2-5 seconds on the journal file; using filefrag. I'm sure there's a better way to trace a single file being read/written to than this, but... It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... >>> >>> >>> Essentially yes, but that causes all kinds of other problems. >> >> >> Drat. >> > Admittedly most of the problems are use-case specific (you can't afford to > lose transactions in a financial database for example, so it functionally > has to call fsync after each transaction), but most of it stems from the > fact that BTRFS is doing a lot of the same stuff that much of the 'problem' > software is doing itself internally. > Seems like the old way of doing things, and the staleness of the internet, have colluded to create a lot of nervousness and misuse of fsync. The very fact Btrfs needs a log tree to deal with fsync's in a semi-sane way... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-17 14:34, Chris Murphy wrote: On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarnwrote: What is a high end SSD these days? Built-in NVMe? One with a good FTL in the firmware. At minimum, the good Samsung EVO drives, the high quality Intel ones, and the Crucial MX series, but probably some others. My choice of words here probably wasn't the best though. It's a confusing market that sorta defies figuring out what we've got. I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung EVO+ SD Card in an Intel NUC. They use that same EVO branding on an $11 SD Card. And then there's the Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 in another laptop. What makes it even more confusing is that other than Samsung (who _only_ use their own flash and controllers), manufacturer does not map to controller choice consistently, and even two drives with the same controller may have different firmware (and thus different degrees of reliability, those OCZ drives that were such crap at data retention were the result of a firmware option that the controller manufacturer pretty much told them not to use on production devices). So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Given how the 'ssd' mount option behaves and the frequency that most systemd instances write to their journals, that's actually reasonably expected. We look for big chunks of free space to write into and then align to 2M regardless of the actual size of the write, which in turn means that files like the systemd journal which see lots of small (relatively speaking) writes will have way more extents than they should until you defragment them. Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. Ah, apologies for my misunderstanding. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. There have only been a couple of changes in the write patterns that I know of, but I would double check that the values for Seal and Compress in the journald.conf file are the same, as I know for a fact that changing those does change the write patterns (not much, but they do change). Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems. Drat. Admittedly most of the problems are use-case specific (you can't afford to lose transactions in a financial database for example, so it functionally has to call fsync after each transaction), but most of it stems from the fact that BTRFS is doing a lot of the same stuff that much of the 'problem' software is doing itself internally. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 4/17/17, Roman Mamedov <r...@romanrm.net> wrote: > "Austin S. Hemmelgarn" <ahferro...@gmail.com> wrote: >> * Compression should help performance and device lifetime most of the >> time, unless your CPU is fully utilized on a regular basis (in which >> case it will hurt performance, but still improve device lifetimes). > Days are long gone since the end user had to ever think about device lifetimes > with SSDs. Refer to endurance studies such as > It has been demonstrated that all SSDs on the market tend to overshoot even > their rated TBW by several times, as a result it will take any user literally > dozens of years to wear out the flash no matter which filesystem or what > settings used. And most certainly it's not worth it changing anything > significant in your workflow (such as enabling compression if it's > otherwise inconvenient or not needed) just to save the SSD lifetime. Going over the thread following questions come to my mind: - What exactly does btrfs ssd option does relative to plain mode? - Most(all?) SSDs employ wear leveling. Isn't it? That is they are constrantly remapping their blocks under the hood. So isn't it meaningless to speak of some kind of a block forging/fragmentation/etc.. affect of any writing pattern? - If it is so, Doesn't it mean that there is no better ssd usage strategy other than minimizing the total bytes written? That is whatever we do, if it contributes to this fact it is good, otherwise bad. Are all other things are beyond any user control? Is there a recommended setting? - How about "data retension" experiences? It is known that new ssds can hold data safely for longer period. As they age that margin gets shorter. As an extreme case if I write into a new ssd and shelve it, can i get back my data back after 5 years? How about a file written 5 years ago and never touched again although rest of the ssd is in active use during that period? - Yes may be lifetimes getting irrelevant. However TBW has still direct relation with data retension capability. Knowing that writing more data to a ssd can reduce the "life time of your data" is something strange. - But someone can come and say: Hey don't worry about "data retension years". Because your ssd will already be dead before data retension becomes a problem for you... Which is relieving.. :)) Anyway what are your opinions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarnwrote: >> What is a high end SSD these days? Built-in NVMe? > > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones, and the Crucial MX series, but probably > some others. My choice of words here probably wasn't the best though. It's a confusing market that sorta defies figuring out what we've got. I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung EVO+ SD Card in an Intel NUC. They use that same EVO branding on an $11 SD Card. And then there's the Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 in another laptop. >> So long as this file is not reflinked or snapshot, filefrag shows a >> pile of mostly 4096 byte blocks, thousands. But as they're pretty much >> all continuous, the file fragmentation (extent count) is usually never >> higher than 12. It meanders between 1 and 12 extents for its life. >> >> Except on the system using ssd_spread mount option. That one has a >> journal file that is +C, is not being snapshot, but has over 3000 >> extents per filefrag and btrfs-progs/debugfs. Really weird. > > Given how the 'ssd' mount option behaves and the frequency that most systemd > instances write to their journals, that's actually reasonably expected. We > look for big chunks of free space to write into and then align to 2M > regardless of the actual size of the write, which in turn means that files > like the systemd journal which see lots of small (relatively speaking) > writes will have way more extents than they should until you defragment > them. Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. >> Now, systemd aside, there are databases that behave this same way >> where there's a small section contantly being overwritten, and one or >> more sections that grow the data base file from within and at the end. >> If this is made cow, the file will absolutely fragment a ton. And >> especially if the changes are mostly 4KiB block sizes that then are >> fsync'd. >> >> It's almost like we need these things to not fsync at all, and just >> rely on the filesystem commit time... > > Essentially yes, but that causes all kinds of other problems. Drat. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, 17 Apr 2017 07:53:04 -0400 "Austin S. Hemmelgarn"wrote: > General info (not BTRFS specific): > * Based on SMART attributes and other factors, current life expectancy > for light usage (normal desktop usage) appears to be somewhere around > 8-12 years depending on specifics of usage (assuming the same workload, > F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper > end, XFS is roughly in the middle, ext4 and NTFS are on the low end > (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the > bottom of the barrel). Life expectancy for an SSD is defined not in years, but in TBW (terabytes written), and AFAICT that's not "from host", but "to flash" (some SSDs will show you both values in two separate SMART attributes out of the box, on some it can be unlocked). Filesystem may come into play only by the amount of write amplification they cause (how much "to flash" is greater than "from host"). Do you have any test data to show that FSes are ranked in that order by WA they cause, or is it all about "general feel" and how they are branded (F2FS says so, so it must be the best) > * Queued DISCARD support is still missing in most consumer SATA SSD's, > which in turn makes the trade-off on those between performance and > lifetime much sharper. My choice was to make a script to run from crontab, using "fstrim" on all mounted SSDs nightly, and aside from that all FSes are mounted with "nodiscard". Best of the both worlds, and no interference with actual IO operation. > * Modern (2015 and newer) SSD's seem to have better handling in the FTL > for the journaling behavior of filesystems like ext4 and XFS. I'm not > sure if this is actually a result of the FTL being better, or some > change in the hardware. Again, what makes you think this, did you observe the write amplification readings and now those are demonstrably lower than on "2014 and older" SSDs? So, by how much, and which models did you compare? > * In my personal experience, Intel, Samsung, and Crucial appear to be > the best name brands (in relative order of quality). I have personally > had bad experiences with SanDisk and Kingston SSD's, but I don't have > anything beyond circumstantial evidence indicating that it was anything > but bad luck on both counts. Why not think in terms not of "name brands" but platforms, i.e. a controller model + flash combination. For instance Intel have been using some other companies' controllers in their SSDs. Kingston uses tons of various controllers (Sandforce/Phison/Marvell/more?) depending on the model and range. > * Files with NOCOW and filesystems with 'nodatacow' set will both hurt > performance for BTRFS on SSD's, and appear to reduce the lifetime of the > SSD. "Appear to"? Just... what. So how many SSDs did you have fail under nocow? Or maybe can we get serious in a technical discussion? Did you by any chance mean cause more writes to the SSD and more "to flash" writes (resulting in a higher WA). If so, then by how much, and what was your test scenario comparing the same usage with and without nocow? > * Compression should help performance and device lifetime most of the > time, unless your CPU is fully utilized on a regular basis (in which > case it will hurt performance, but still improve device lifetimes). Days are long gone since the end user had to ever think about device lifetimes with SSDs. Refer to endurance studies such as http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead http://ssdendurancetest.com/ https://3dnews.ru/938764/ It has been demonstrated that all SSDs on the market tend to overshoot even their rated TBW by several times, as a result it will take any user literally dozens of years to wear out the flash no matter which filesystem or what settings used. And most certainly it's not worth it changing anything significant in your workflow (such as enabling compression if it's otherwise inconvenient or not needed) just to save the SSD lifetime. On Mon, 17 Apr 2017 13:13:39 -0400 "Austin S. Hemmelgarn" wrote: > > What is a high end SSD these days? Built-in NVMe? > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones As opposed to bad Samsung EVO drives and low-quality Intel ones? > and the Crucial MX series, but > probably some others. My choice of words here probably wasn't the best > though. Again, which controller? Crucial does not manufacture SSD controllers on their own, they just pack and brand stuff manufactured by someone else. So if you meant Marvell based SSDs, then that's many brands, not just Crucial. > For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets > rewritten in-place. This means that cheap FTL's will rewrite that erase > block in-place (which won't hurt performance but will impact device > lifetime), and good ones will rewrite into a free
Re: Btrfs/SSD
On 2017-04-17 12:58, Chris Murphy wrote: On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarnwrote: Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. What is a high end SSD these days? Built-in NVMe? One with a good FTL in the firmware. At minimum, the good Samsung EVO drives, the high quality Intel ones, and the Crucial MX series, but probably some others. My choice of words here probably wasn't the best though. * Files with NOCOW and filesystems with 'nodatacow' set will both hurt performance for BTRFS on SSD's, and appear to reduce the lifetime of the SSD. Can you elaborate. It's an interesting problem, on a small scale the systemd folks have journald set +C on /var/log/journal so that any new journals are nocow. There is an initial fallocate, but the write behavior is writing in the same place at the head and tail. But at the tail, the writes get pushed torward the middle. So the file is growing into its fallocated space from the tail. The header changes in the same location, it's an overwrite. For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets rewritten in-place. This means that cheap FTL's will rewrite that erase block in-place (which won't hurt performance but will impact device lifetime), and good ones will rewrite into a free block somewhere else but may not free that original block for quite some time (which is bad for performance but slightly better for device lifetime). When BTRFS does a COW operation on a block however, it will guarantee that that block moves. Because of this, the old location will either: 1. Be discarded by the FS itself if the 'discard' mount option is set. 2. Be caught by a scheduled call to 'fstrim'. 3. Lay dormant for at least a while. The first case is ideal for most FTL's, because it lets them know immediately that that data isn't needed and the space can be reused. The second is close to ideal, but defers telling the FTL that the block is unused, which can be better on some SSD's (some have firmware that handles wear-leveling better in batches). The third is not ideal, but is still better than what happens with NOCOW or nodatacow set. Overall, this boils down to the fact that most FTL's get slower if they can't wear-level the device properly, and in-place rewrites make it harder for them to do proper wear-leveling. So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Given how the 'ssd' mount option behaves and the frequency that most systemd instances write to their journals, that's actually reasonably expected. We look for big chunks of free space to write into and then align to 2M regardless of the actual size of the write, which in turn means that files like the systemd journal which see lots of small (relatively speaking) writes will have way more extents than they should until you defragment them. Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarnwrote: > Regarding BTRFS specifically: > * Given my recently newfound understanding of what the 'ssd' mount option > actually does, I'm inclined to recommend that people who are using high-end > SSD's _NOT_ use it as it will heavily increase fragmentation and will likely > have near zero impact on actual device lifetime (but may _hurt_ > performance). It will still probably help with mid and low-end SSD's. What is a high end SSD these days? Built-in NVMe? > * Files with NOCOW and filesystems with 'nodatacow' set will both hurt > performance for BTRFS on SSD's, and appear to reduce the lifetime of the > SSD. Can you elaborate. It's an interesting problem, on a small scale the systemd folks have journald set +C on /var/log/journal so that any new journals are nocow. There is an initial fallocate, but the write behavior is writing in the same place at the head and tail. But at the tail, the writes get pushed torward the middle. So the file is growing into its fallocated space from the tail. The header changes in the same location, it's an overwrite. So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-14 07:02, Imran Geriskovan wrote: Hi, Sometime ago we had some discussion about SSDs. Within the limits of unknown/undocumented device infos, we loosely had covered data retension capability/disk age/life time interrelations, (in?)effectiveness of btrfs dup on SSDs, etc.. Now, as time passed and with some accumulated experience on SSDs I think we again can have a status check/update on them if you can share your experiences and best practices. So if you have something to share about SSDs (it may or may not be directly related with btrfs) I'm sure everybody here will be happy to hear it. General info (not BTRFS specific): * Based on SMART attributes and other factors, current life expectancy for light usage (normal desktop usage) appears to be somewhere around 8-12 years depending on specifics of usage (assuming the same workload, F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper end, XFS is roughly in the middle, ext4 and NTFS are on the low end (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the bottom of the barrel). * Queued DISCARD support is still missing in most consumer SATA SSD's, which in turn makes the trade-off on those between performance and lifetime much sharper. * Modern (2015 and newer) SSD's seem to have better handling in the FTL for the journaling behavior of filesystems like ext4 and XFS. I'm not sure if this is actually a result of the FTL being better, or some change in the hardware. * In my personal experience, Intel, Samsung, and Crucial appear to be the best name brands (in relative order of quality). I have personally had bad experiences with SanDisk and Kingston SSD's, but I don't have anything beyond circumstantial evidence indicating that it was anything but bad luck on both counts. Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. * Files with NOCOW and filesystems with 'nodatacow' set will both hurt performance for BTRFS on SSD's, and appear to reduce the lifetime of the SSD. * Compression should help performance and device lifetime most of the time, unless your CPU is fully utilized on a regular basis (in which case it will hurt performance, but still improve device lifetimes). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs/SSD
Hi, Sometime ago we had some discussion about SSDs. Within the limits of unknown/undocumented device infos, we loosely had covered data retension capability/disk age/life time interrelations, (in?)effectiveness of btrfs dup on SSDs, etc.. Now, as time passed and with some accumulated experience on SSDs I think we again can have a status check/update on them if you can share your experiences and best practices. So if you have something to share about SSDs (it may or may not be directly related with btrfs) I'm sure everybody here will be happy to hear it. Regard, Imran -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: SSD related mount option dependency rework.
Any new comments? Cc: Satoru Sorry for the late reply. Unfortunately, it seems that your mail doesn't show in my inbox, but only occurs in patchwork. About writing the dependency in btrfs.txt, since the patch is based on the btrfs.txt, it already shows the dependency, so I'd like not to repeat them in btrfs.txt. Thanks, Qu Original Message Subject: [PATCH] btrfs: SSD related mount option dependency rework. From: Qu Wenruo quwen...@cn.fujitsu.com To: linux-btrfs@vger.kernel.org Date: 2014年08月01日 11:27 According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd has their own dependency(See below), but only ssd_spread implying ssd is implemented. ssd_spread implies ssd, conflicts nossd. ssd conflicts nossd. nossd conflicts ssd and ssd_spread. This patch adds ssd{,_spread} confliction with nossd. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/super.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8e16bca..2508a16 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) compress_type); } break; - case Opt_ssd: - btrfs_set_and_info(root, SSD, - use ssd allocation scheme); - break; case Opt_ssd_spread: btrfs_set_and_info(root, SSD_SPREAD, use spread ssd allocation scheme); + /* suppress the ssd mount option log */ btrfs_set_opt(info-mount_opt, SSD); + /* fall through for other ssd routine */ + case Opt_ssd: + btrfs_set_and_info(root, SSD, + use ssd allocation scheme); + btrfs_clear_opt(info-mount_opt, NOSSD); break; case Opt_nossd: btrfs_set_and_info(root, NOSSD, not using ssd allocation scheme); btrfs_clear_opt(info-mount_opt, SSD); + btrfs_clear_opt(info-mount_opt, SSD_SPREAD); break; case Opt_barrier: btrfs_clear_and_info(root, NOBARRIER, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: SSD related mount option dependency rework.
Hi Qu, (2014/08/01 12:27), Qu Wenruo wrote: According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd has their own dependency(See below), but only ssd_spread implying ssd is implemented. ssd_spread implies ssd, conflicts nossd. ssd conflicts nossd. nossd conflicts ssd and ssd_spread. This patch adds ssd{,_spread} confliction with nossd. How about write down above-mentioned dependencies in Documentations/filesystem/btrfs.txt too? Thanks, Satoru Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/super.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8e16bca..2508a16 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) compress_type); } break; - case Opt_ssd: - btrfs_set_and_info(root, SSD, -use ssd allocation scheme); - break; case Opt_ssd_spread: btrfs_set_and_info(root, SSD_SPREAD, use spread ssd allocation scheme); + /* suppress the ssd mount option log */ btrfs_set_opt(info-mount_opt, SSD); + /* fall through for other ssd routine */ + case Opt_ssd: + btrfs_set_and_info(root, SSD, +use ssd allocation scheme); + btrfs_clear_opt(info-mount_opt, NOSSD); break; case Opt_nossd: btrfs_set_and_info(root, NOSSD, not using ssd allocation scheme); btrfs_clear_opt(info-mount_opt, SSD); + btrfs_clear_opt(info-mount_opt, SSD_SPREAD); break; case Opt_barrier: btrfs_clear_and_info(root, NOBARRIER, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: SSD related mount option dependency rework.
According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd has their own dependency(See below), but only ssd_spread implying ssd is implemented. ssd_spread implies ssd, conflicts nossd. ssd conflicts nossd. nossd conflicts ssd and ssd_spread. This patch adds ssd{,_spread} confliction with nossd. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/super.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8e16bca..2508a16 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) compress_type); } break; - case Opt_ssd: - btrfs_set_and_info(root, SSD, - use ssd allocation scheme); - break; case Opt_ssd_spread: btrfs_set_and_info(root, SSD_SPREAD, use spread ssd allocation scheme); + /* suppress the ssd mount option log */ btrfs_set_opt(info-mount_opt, SSD); + /* fall through for other ssd routine */ + case Opt_ssd: + btrfs_set_and_info(root, SSD, + use ssd allocation scheme); + btrfs_clear_opt(info-mount_opt, NOSSD); break; case Opt_nossd: btrfs_set_and_info(root, NOSSD, not using ssd allocation scheme); btrfs_clear_opt(info-mount_opt, SSD); + btrfs_clear_opt(info-mount_opt, SSD_SPREAD); break; case Opt_barrier: btrfs_clear_and_info(root, NOBARRIER, -- 2.0.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS, SSD and single metadata
Hi, I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I know...], and I noticed that the FS got created with metadata in DUP mode, contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be SINGLE... Well I don't know if my system didn't identify the SSD because of the LVM+LUKS stack (however it mounts well by itself with the ssd flag and accepts the discard option [yes, I know...]), or if the manpage is obsolete or if this feature just doesn't work...? The SSD being a Micron RealSSD C400 For both SSD preservation and data integrity, would it be advisable to change metadata to SINGLE using a rebalance, or if I'd better just leave things the way they are...? TIA for any insight. -- Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E Tout le malheur des hommes vient de ce qu'ils ne vivent pas dans _le_ monde, mais dans _leur_ monde. -- Héraclite. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS, SSD and single metadata
On 2014-06-16 03:54, Swâmi Petaramesh wrote: Hi, I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I know...], and I noticed that the FS got created with metadata in DUP mode, contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be SINGLE... Well I don't know if my system didn't identify the SSD because of the LVM+LUKS stack (however it mounts well by itself with the ssd flag and accepts the discard option [yes, I know...]), or if the manpage is obsolete or if this feature just doesn't work...? The SSD being a Micron RealSSD C400 For both SSD preservation and data integrity, would it be advisable to change metadata to SINGLE using a rebalance, or if I'd better just leave things the way they are...? TIA for any insight. What mkfs.btrfs looks at is /sys/block/whatever-device/queue/rotational, if that is 1 it knows that the device isn't a SSD. I believe that LVM passes through whatever the next lower layer's value is, but dmcrypt (and by extension LUKS) always force it to a 1 (possibly to prevent programs from using heuristics for enabling discard) smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS, SSD and single metadata
Hi Austin, and thanks for your reply. Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit : What mkfs.btrfs looks at is /sys/block/whatever-device/queue/rotational, if that is 1 it knows that the device isn't a SSD. I believe that LVM passes through whatever the next lower layer's value is, but dmcrypt (and by extension LUKS) always force it to a 1 (possibly to prevent programs from using heuristics for enabling discard) In the current running condition, the system clearly sees this is *not* rotational, even thru the LVM/dmcrypt stack : # mount | grep btrfs /dev/mapper/VG-LINUX on / type btrfs (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag) # ll /dev/mapper/VGV-LINUX lrwxrwxrwx. 1 root root 7 16 juin 09:21 /dev/mapper/VG-LINUX - ../dm-1 # cat /sys/block/dm-1/queue/rotational 0 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it from a live USB key in which both the lvm.conf and crypttab had not been taylored to allow trim commands... However, now that the FS is created, I still wonder whether I should use a rebalance to change the metadata from DUP to SINGLE, or if Id' better stay with DUP... Kind regards. -- Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS, SSD and single metadata
On 2014-06-16 07:18, Swâmi Petaramesh wrote: Hi Austin, and thanks for your reply. Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit : What mkfs.btrfs looks at is /sys/block/whatever-device/queue/rotational, if that is 1 it knows that the device isn't a SSD. I believe that LVM passes through whatever the next lower layer's value is, but dmcrypt (and by extension LUKS) always force it to a 1 (possibly to prevent programs from using heuristics for enabling discard) In the current running condition, the system clearly sees this is *not* rotational, even thru the LVM/dmcrypt stack : # mount | grep btrfs /dev/mapper/VG-LINUX on / type btrfs (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag) # ll /dev/mapper/VGV-LINUX lrwxrwxrwx. 1 root root 7 16 juin 09:21 /dev/mapper/VG-LINUX - ../dm-1 # cat /sys/block/dm-1/queue/rotational 0 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it from a live USB key in which both the lvm.conf and crypttab had not been taylored to allow trim commands... However, now that the FS is created, I still wonder whether I should use a rebalance to change the metadata from DUP to SINGLE, or if Id' better stay with DUP... Kind regards. I'd personally stay with the DUP profile, but then that's just me being paranoid. You will almost certainly get better performance using the SINGLE profile instead of DUP, but this is mostly due to it requiring fewer blocks to be encrypted by LUKS (Which is almost certainly your primary bottleneck unless you have some high-end crypto-accelerator card). smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS, SSD and single metadata
On Mon, 16 Jun 2014 07:23:14 Austin S Hemmelgarn wrote: I'd personally stay with the DUP profile, but then that's just me being paranoid. You will almost certainly get better performance using the SINGLE profile instead of DUP, but this is mostly due to it requiring fewer blocks to be encrypted by LUKS (Which is almost certainly your primary bottleneck unless you have some high-end crypto-accelerator card). On my Q8400 workstation running BTRFS over LUKS on an Intel SSD the primary bottleneck has always been BTRFS. The message I wrote earlier today about BTRFS fallocate() performance was on this system, I had BTRFS using kernel CPU time for periods of 10+ seconds without ANY disk IO - so LUKS wasn't a performance issue. So far I've never seen LUKS be a performance bottleneck. When running LUKS on spinning media the disk seek performance will almost always be the bottleneck. The worst case for LUKS is transferring large amounts of data such as contiguous reads. In a contiguous read test I'm seeing 120MB/s for LUKS on a SSD and 200MB/s for direct access to the same SSD. That is a reasonable difference, but it's not something I've been able to hit with any real-world use while BTRFS metadata performance is often an issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS, SSD and single metadata
Swâmi Petaramesh posted on Mon, 16 Jun 2014 09:54:01 +0200 as excerpted: I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I know...], and I noticed that the FS got created with metadata in DUP mode, contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be SINGLE... Well I don't know if my system didn't identify the SSD because of the LVM+LUKS stack (however it mounts well by itself with the ssd flag and accepts the discard option [yes, I know...]), or if the manpage is obsolete or if this feature just doesn't work...? Does btrfs automatically add the ssd mount option or do you have to add it? If you have to add it, that means btrfs isn't detecting the ssd, which would explain why mkfs.btrfs didn't detect it either... as you said very likely due to the LVM over LUKS stack. I believe the detection is actually based on what the kernel reports. I may be mistaken and I'm not running stacked devices ATM in ordered to check them, but check /sys/block/device/queue/rotational. On my ssds, the value in that file is 0. On my spinning rust, it's 1. If that is indeed what it's looking at, you can verify that your hardware device is actually detected by the kernel as rotational 0, and then check the various layers of your stack and see where the rotational 0 (or the rotational file itself) gets dropped. The SSD being a Micron RealSSD C400 For both SSD preservation and data integrity, would it be advisable to change metadata to SINGLE using a rebalance, or if I'd better just leave things the way they are...? TIA for any insight. That's a very good question, on which there has been some debate on this list. The first question: Does your SSD firmware do compression and dedup, or not? IIRC the sandforce (I believe that's the firmware name) firmware does compression and dedup, but not all firmware does. I know a bullet- point feature of my SSDs (several Corsair Neutron 256 GB, FWIW) is that they do NOT do this sort of compression and dedup -- what the system tells them to save is what they save, and if you tell it to save a hundred copies of the same thing, that's what it does. (These SSDs are targeted at commercial usage and this is billed as a performance reliability feature.) The reason originally given for defaulting to single mode metadata on SSDs was that it was due to this possible dedup -- dup-mode metadata might actually end up single-copy-only due to the firmware compression and dedup in any case. Between that and their typically smaller size, I guess the decision was that single was the best default. However, it occurs to me that with the LUKS encryption layer, I'm not entirely sure if duplication at the btrfs level would end up as the same encrypted stream headed to the hardware in any case. If it would encrypt the two copies of the dup-mode metadata as different, then the hardware dedup/compression wouldn't work on it anyway. OTOH, if it encrypts them as the same stream headed to hardware, then again, it would matter. Meanwhile... which is better and should you rebalance to single? In terms of data integrity, dup mode is definitely better, since if there's damage to one copy such that it doesn't pass checksum verification, you still have the other copy to read from... and to rebuild the damaged copy from. OTOH, single does take less space, and performance should be slightly better. If you're keeping good backups anyway, or if the ssd's firmware might be mucking with things leaving you with only a single copy in any case, single mode could be a better choice. FWIW, while most of my partitions are btrfs raid1 here, so the second copy is on a different physical device, /boot is an exception. I have a separate /boot on each device, pointed at by the grub loaded on each device so I can use the BIOS boot selector to choose which one I boot. That lets me keep a working /boot that I boot most of the time on one device, and a backup /boot on the other device, in case something goes wrong with the first. So those dedicated /boot partitions are an exception to my normal btrfs raid1. They're both (working and primary backup) 256 MiB mixed-data/ metadata mode as they're so small, which means data and metadata must both be the same mode, and they're both set to dup mode. Which means they effectively only hold (a bit under) 128 MiB worth of data, since it's dup mode for both data and metadata, but 128 MiB is fine for /boot, as long as I don't let too many kernels build up. So I'd obviously recommend dup mode, unless you know your ssd's firmware is going to dedup it anyway. But that's just me. As I said, there has been some discussion about it on the list, and some people make other choices. You can of course dig in the list archives if you want to see previous threads on the subject, but ultimately, it's upto you. =:^) -- Duncan - List replies preferred. No HTML msgs. Every
Re: BTRFS, SSD and single metadata
Le lundi 16 juin 2014, 12:16:33 Duncan a écrit : Does btrfs automatically add the ssd mount option or do you have to add it? If you have to add it, that means btrfs isn't detecting the ssd, First time I mounted the freshly created filesystem, it actually added the ssd option by itself. Thus indeed, it could see this was a SSD... However, it occurs to me that with the LUKS encryption layer, I'm not entirely sure if duplication at the btrfs level would end up as the same encrypted stream headed to the hardware in any case. If it would encrypt the two copies of the dup-mode metadata as different, then the hardware dedup/compression wouldn't work on it anyway. OTOH, if it encrypts them as the same stream headed to hardware, then again, it would matter. That makes an excellent point. 2 copies of the same binary data in different filesystem sectors, process thru LUKS, will create entirely different binary ciphertext, so for sure both metadata copies will always be different and cannot be deduped by SSD firmware... Kind regards. -- Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E Les êtres humains, cernés par leurs besoins, s'agitent sans but comme des lapins pris au piège. Ainsi, moine, détourne-toi de ces besoins et trouve la liberté. -- Buddha Shakyamuni -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS SSD RAID 1: Does it trim on both devices? :)
Hi! Since a few days this ThinkPad T520 has 780 GB SSD capacity. The 300 GB of the Intel SSD 320 were almost full and that 480 GB Crucial m500 mSATA SSD was cheap enough to just buy it. I created a new logical volume for big and not that often changed files that is just on the msata and moved all the music and photo files to it. Then I thought I´d work in place, just shrink the /home volume on the Intel SSD 320 and then add a similar sized volume on the Crucial and have it rebalanced to metadata and data RAID 1. Unfortunately the shrinking hung at some time. I didn´t check with a scrub afterward, but I rsync´d all data to a newly created volume on the Crucial and then rebalanced the other way around. Since the rsync didn´t show any I/O errors I suppose the BTRFS with the shrink failure was still okay. I report this shrink failure separately. So now I have a BTRFS RAID 1 for /home and / and I wondered: Does BTRFS trim on all devices on issuing fstrim? Some details: merkaba:~ btrfs fi show Label: debian uuid: […] Total devices 2 FS bytes used 15.89GiB devid1 size 30.00GiB used 19.03GiB path /dev/mapper/sata-debian devid2 size 30.00GiB used 19.01GiB path /dev/dm-2 Label: home uuid: […] Total devices 2 FS bytes used 91.59GiB devid1 size 150.00GiB used 102.00GiB path /dev/dm-0 devid2 size 150.00GiB used 102.00GiB path /dev/mapper/sata-home Label: daten uuid: […] Total devices 1 FS bytes used 148.36GiB devid1 size 200.00GiB used 150.02GiB path /dev/mapper/msata-daten merkaba:~ LANG=C df -hT -t btrfs Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/sata-debian btrfs 60G 32G 26G 56% / /dev/dm-0 btrfs 300G 184G 113G 62% /home /dev/mapper/sata-debian btrfs 60G 32G 26G 56% /mnt/debian-zeit /dev/dm-0 btrfs 300G 184G 113G 62% /mnt/home-zeit /dev/mapper/msata-daten btrfs 200G 149G 51G 75% /daten (the zeit mounts show the root sub volume for snapshots) merkaba:[…] ./btrfs fi df / Disk size:60.00GB Disk allocated: 38.04GB Disk unallocated: 21.96GB Used: 15.89GB Free (Estimated): 14.13GB (Max: 25.10GB, min: 14.12GB) Data to disk ratio: 50 % merkaba:[…] ./btrfs fi df /home Disk size: 300.00GB Disk allocated: 204.00GB Disk unallocated: 96.00GB Used: 91.59GB Free (Estimated): 58.41GB (Max: 106.41GB, min: 58.41GB) Data to disk ratio: 50 % merkaba:[…] ./btrfs device disk-usage / /dev/dm-2 30.00GB Data,RAID1: 17.00GB Metadata,RAID1: 2.00GB System,RAID1: 8.00MB Unallocated: 10.99GB /dev/mapper/sata-debian30.00GB Data,Single: 8.00MB Data,RAID1: 17.00GB Metadata,Single: 8.00MB Metadata,RAID1: 2.00GB System,Single:4.00MB System,RAID1: 8.00MB Unallocated: 10.97GB merkaba:[…] ./btrfs filesystem disk-usage -t / Data DataMetadata Metadata System System Single RAID1 Single RAID1Single RAID1 Unallocated /dev/dm-2- 17.00GB- 2.00GB - 8.00MB 10.99GB /dev/mapper/sata-debian 8.00MB 17.00GB 8.00MB 2.00GB 4.00MB 8.00MB 10.97GB == === == === === Total 8.00MB 17.00GB 8.00MB 2.00GB 4.00MB 8.00MB 21.96GB Used 0.00 15.13GB 0.00 778.00MB 0.00 16.00KB merkaba:[…] ./btrfs device disk-usage /home /dev/dm-0 150.00GB Data,RAID1: 98.00GB Metadata,RAID1: 4.00GB System,Single:4.00MB Unallocated: 48.00GB /dev/mapper/sata-home 150.00GB Data,RAID1: 98.00GB Metadata,RAID1: 4.00GB Unallocated: 48.00GB merkaba:[…] ./btrfs filesystem disk-usage -t /home DataMetadata System RAID1 RAID1Single Unallocated /dev/dm-0 98.00GB 4.00GB 4.00MB 48.00GB /dev/mapper/sata-home 98.00GB 4.00GB - 48.00GB === === === Total 98.00GB 4.00GB 4.00MB 96.00GB Used 89.60GB 1.99GB 16.00KB And some healthly over prosioning: merkaba:~ vgs VG#PV #LV #SN Attr VSize VFree msata 1 3 0 wz--n- 446,64g 66,64g sata1 3 0 wz--n- 278,99g 86,99g Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To
Re: BTRFS SSD
Yuehai Xu wrote (ao): So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. The FTL will make sure the write cycles are evenly divided among the physical blocks, regardless of how often you overwrite a single spot on the fs. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Can you show the script you use to test this, provide some info regarding your setup, and show the numbers you see? Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On 29/09/2010 23:31, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. SSDs already do copy-on-write. They can't change small parts of the data in a block, but have to re-write the block. While that could be done by reading the whole erase block to a ram buffer, changing the data, erasing the flash block, then re-writing, this is not what happens in practice. To make efficient use of write blocks that are smaller than erase blocks, and to provide wear levelling, the flash disk will implement a small change to a block by writing a new copy of the modified block to a different part of the flash, then updating its block indirection tables. BTRFS just makes this process a bit more explicit (except for superblock writes). What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Different file systems have different strengths and weaknesses. I haven't actually tested BTRFS much, but my understanding is that it will be significantly slower than EXT in certain cases, such as small modifications to large files (since copy-on-write means a lot of extra disk activity in such cases). But for other things it is faster. Also remember that BTRFS is under development - optimising for raw speed comes at a lower priority than correctness and safety of data, and implementation of BTRFS features. Once everyone is happy with the stability of the file system and its functionality and tools, you can expect the speed to improve somewhat over time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Thu, Sep 30, 2010 at 3:51 AM, David Brown da...@westcontrol.com wrote: On 29/09/2010 23:31, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. SSDs already do copy-on-write. They can't change small parts of the data in a block, but have to re-write the block. While that could be done by reading the whole erase block to a ram buffer, changing the data, erasing the flash block, then re-writing, this is not what happens in practice. To make efficient use of write blocks that are smaller than erase blocks, and to provide wear levelling, the flash disk will implement a small change to a block by writing a new copy of the modified block to a different part of the flash, then updating its block indirection tables. Yes, the FTL inside the SSDs will do such kind of job, and the overhead should be small once the block mapping is page-level mapping, however, the size of page-level mapping is too large to be stored totally in the SRAM of SSDs, So, many complicated algorithms have been developed to optimize this. In another word, SSDs might not always be smart enough to do wear leveling with small overhead. This is my subjective opinion. BTRFS just makes this process a bit more explicit (except for superblock writes). As you have said, the superblocks should be over written, is it frequent? If it is, is it possible to be potential bottleneck for the throughput of SSDs? Afterall, SSDs are not happy with over-write. Of course, few people really knows what's the algorithms really are for the FTL, which determines the efficiency of SSDs actually. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Different file systems have different strengths and weaknesses. I haven't actually tested BTRFS much, but my understanding is that it will be significantly slower than EXT in certain cases, such as small modifications to large files (since copy-on-write means a lot of extra disk activity in such cases). But for other things it is faster. Also remember that BTRFS is under development - optimising for raw speed comes at a lower priority than correctness and safety of data, and implementation of BTRFS features. Once everyone is happy with the stability of the file system and its functionality and tools, you can expect the speed to improve somewhat over time. My test case for PostMark is: set file size 9216 15360 (file size from 9216 bytes to 15360 bytes) set number 5(file number is 5) write throughput(MB/s) for different file systems in Intel SSD X25-V: EXT3: 28.09 NILFS2: 10 BTRFS: 17.35 EXT4: 31.04 XFS:
Re: BTRFS SSD
On Thu, Sep 30, 2010 at 3:15 AM, Sander san...@humilis.net wrote: Yuehai Xu wrote (ao): So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. The FTL will make sure the write cycles are evenly divided among the physical blocks, regardless of how often you overwrite a single spot on the fs. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Can you show the script you use to test this, provide some info regarding your setup, and show the numbers you see? My test case for PostMark is: set file size 9216 15360 (file size from 9216 bytes to 15360 bytes) set number 5(file number is 5) write throughput(MB/s) for different file systems in Intel SSD X25-V: EXT3: 28.09 NILFS2: 10 BTRFS: 17.35 EXT4: 31.04 XFS: 11.56 REISERFS: 28.09 EXT2: 15.94 Thanks, Yuehai Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. What do you think the major work that BTRFS can do to improve the performance for SSD? I know FTL has becomes smarter and smarter, the idea of log-structured file system is always implemented inside the SSD by FTL, in that case, it sounds all the issues have been solved no matter what the FS it is in upper stack. But at least, from the results of benchmarks on the internet show that the performance from different FS are quite different, such as NILFS2 and BTRFS. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
Hi, On Wed, Sep 29, 2010 at 11:37 AM, Dipl.-Ing. Michael Niederle mniede...@gmx.at wrote: Hi Yuehai! I tested nilfs2 and btrfs for the use with flash based pen drives. nilfs2 performed incredibly well as long as there were enough free blocks. But the garbage collector of nilfs used too much IO-bandwidth to be useable (with slow-write flash devices). I also tested the performance of write for INTEL X25-V SSD by postmark, the results are totally different from the results of INTEL X25-M(http://www.usenix.org/event/lsf08/tech/shin_SSD.pdf). In his test, the performance of NILFS2 is the best over all, however, in my test, ext3 is the best while NILFS2 is the worst, almost 10 times less than ext3 for the throughput of write. So, what's the role of file system to handle these tricky storage? Different throughput might be gotten by different file system. The question is why nilfs2 and btrfs perform so well compared with ext3 without considering my results, here I just talk about SSD, since the FTL internal should always do the same thing as the file system, that redirects the write to a new place instead of writing to the original place. The throughput for different file system should be more or less the same. btrfs on the other side performed very well - a lot better than conventional file systems like ext2/3 or reiserfs. After switching the mount-options to noatime I was able to run a complete Linux system from a (quite slow) pen drive without (much) problems. Performance on a fast pen drive is great. I'm using btrfs as the root file system on a daily basis since last Christmas without running into any problems. The performance of file system is determined by the internal structure of SSD? or by the structure of file system? or by the coordination of both file system and SSD? Thanks very much for replying. Greetings, Michael Thanks, Yuehai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? I am afraid I don't quite understand the meaning of your last sentence. Thanks for replying, Yuehai What do you think the major work that BTRFS can do to improve the performance for SSD? I know FTL has becomes smarter and smarter, the idea of log-structured file system is always implemented inside the SSD by FTL, in that case, it sounds all the issues have been solved no matter what the FS it is in upper stack. But at least, from the results of benchmarks on the internet show that the performance from different FS are quite different, such as NILFS2 and BTRFS. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. Sorry for the useless question, but just out of curiosity: doesn't this mean that btrfs has to do quite a lot more writes than ext4 for small file operations? E.g., if you append one block to a file, like a log file, then ext3 should have to do about three writes: data, metadata, and journal (and the latter is always sequential, so it's cheap). But btrfs will need to do more, rewriting parent nodes all the way up the line for both the data and metadata blocks. Why doesn't this hurt performance a lot? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 03:39:07PM -0400, Aryeh Gregor wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. Sorry for the useless question, but just out of curiosity: doesn't this mean that btrfs has to do quite a lot more writes than ext4 for small file operations? E.g., if you append one block to a file, like a log file, then ext3 should have to do about three writes: data, metadata, and journal (and the latter is always sequential, so it's cheap). But btrfs will need to do more, rewriting parent nodes all the way up the line for both the data and metadata blocks. Why doesn't this hurt performance a lot? For a single change, it does write more. However, there are usually many changes to children being performed at once, which only require one change to the parent. Since it's moving everything to new places, btrfs also has much more control over where writes occur, so all the leaves and parents can be written sequentially. ext3 is a slave to the current locations on disk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartell wingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Thanks, Yuehai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs SSD autodetection and mount options
Hello everyone, A quick update on the btrfs SSD modes. The pull request I just to Linus includes autodetection of ssd devices based on the queue rotating flag. You can see this for your devices in /sys/block/xxx/queue/rotating If all the devices in your FS have a 0 in the rotating flag, btrfs automatically enables ssd mode. You can turn this off with mount -o nossd. The default ssd flag tries to find rough groupings of blocks to allocate on, and will try to pack blocks into the free space available. So, if you have something like this (pretending a bitmap of free blocks) free | free | used | free | used | free | free | used Btrfs SSD mode will collect a large region of mostly free blocks and allocate from that. This works well on newer and high end ssds that prefer us to reuse blocks instead of spreading IO across the hole device. But, low end devices may have to do a read/modify/write cycle when we actually do IO in this case. I've added a new mount option for those devices: mount -o ssd_spread. This is not autodetected, you still need to pass it on the mount command line or in /etc/fstab. In ssd_spread mode, btrfs will try much harder to find a contiguous chunk of free blocks and hand those out. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html