Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovanwrote: > > On 5/14/17, Tomasz Kusmierz wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovanwrote: > > On 5/14/17, Tomasz Kusmierz wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
On Sun, May 14, 2017 at 09:21:11PM +, Hugo Mills wrote: > > 2) balance -musage=0 > > 3) balance -musage=20 > >In most cases, this is going to make ENOSPC problems worse, not > better. The reason for doign this kind of balance is to recover unused > space and allow it to be reallocated. The typical behaviour is that > data gets overallocated, and it's metadata which runs out. So, the > last thing you want to be doing is reducing the metadata allocation, > because that's the scarce resource. > >Also, I'd usually recommend using limit=n, where n is approximately > the amount of data overallcation (allocated space less used > space). It's much more controllable than usage. Thanks for that. So, would you just remove the balance -musage=20 altogether? As for limit= I'm not sure if it would be helpful since I run this nightly. Anything that doesn't get done tonight due to limit, would be done tomorrow? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
Le 14/05/2017 à 23:30, Kai Krakow a écrit : > Am Sun, 14 May 2017 22:57:26 +0200 > schrieb Lionel Bouton: > >> I've coded one Ruby script which tries to balance between the cost of >> reallocating group and the need for it.[...] >> Given its current size, I should probably push it on github... > Yes, please... ;-) Most of our BTRFS filesystems are used by Ceph OSD, so here it is : https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 5/14/17, Tomasz Kusmierzwrote: > In terms of over provisioning of SSD it’s a give and take relationship … on > good drive there is enough over provisioning to allow a normal operation on > systems without TRIM … now if you would use a 1TB drive daily without TRIM > and have only 30GB stored on it you will have fantastic performance but if > you will want to store 500GB at roughly 200GB you will hit a brick wall and > you writes will slow dow to megabytes / s … this is symptom of drive running > out of over provisioning space … What exactly happens on a non-trimmed drive? Does it begin to forge certain erase-blocks? If so which are those? What happens when you never trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
Am Sun, 14 May 2017 22:57:26 +0200 schrieb Lionel Bouton: > I've coded one Ruby script which tries to balance between the cost of > reallocating group and the need for it. The basic idea is that it > tries to keep the proportion of free space "wasted" by being allocated > although it isn't used below a threshold. It will bring this > proportion down enough through balance that minor reallocation won't > trigger a new balance right away. It should handle pathological > conditions as well as possible and it won't spend more than 2 hours > working on a single filesystem by default. We deploy this as a daily > cron script through Puppet on all our systems and it works very well > (I didn't have to use balance manually to manage free space since we > did that). Note that by default it sleeps a random amount of time to > avoid IO spikes on VMs running on the same host. You can either edit > it or pass it "0" which will be used for the max amount of time to > sleep bypassing this precaution. > > Here is the latest version : https://pastebin.com/Rrw1GLtx > Given its current size, I should probably push it on github... Yes, please... ;-) > I've seen other maintenance scripts mentioned on this list so you > might something simpler or more targeted to your needs by browsing > through the list's history. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
On Sun, May 14, 2017 at 01:15:09PM -0700, Marc MERLIN wrote: > On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote: > > On 05/13/2017 10:54 PM, Marc MERLIN wrote: > > > Kernel 4.11, btrfs-progs v4.7.3 > > > > > > I run scrub and balance every night, been doing this for 1.5 years on this > > > filesystem. > > > > What are the exact commands you run every day? > > http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html > (at the bottom) > every night: > 1) scrub > 2) balance -musage=0 > 3) balance -musage=20 In most cases, this is going to make ENOSPC problems worse, not better. The reason for doign this kind of balance is to recover unused space and allow it to be reallocated. The typical behaviour is that data gets overallocated, and it's metadata which runs out. So, the last thing you want to be doing is reducing the metadata allocation, because that's the scarce resource. Also, I'd usually recommend using limit=n, where n is approximately the amount of data overallcation (allocated space less used space). It's much more controllable than usage. Hugo. > 4) balance -dusage=0 > 5) balance -dusage=20 > > > > How did I get into such a misbalanced state when I balance every night? > > > > I don't know, since I don't know what you do exactly. :) > > Now you do :) > > > > My filesystem is not full, I can write just fine, but I sure cannot > > > rebalance now. > > > > Yes, because you have quite some allocated but unused space. If btrfs > > cannot just allocate more chunks, it starts trying a bit harder to reuse > > all the empty spots in the already existing chunks. > > Ok. shouldn't balance fix problems just like this? > I have 60GB-ish free, or in this case that's also >25%, that's a lot > > Speaking of unallocated, I have more now: > Device unallocated:993.00MiB > > This kind of just magically fixed itself during snapshot rotation and > deletion I think. > Sure enough, balance works again, but this feels pretty fragile. > Looking again: > Device size: 228.67GiB > Device allocated: 227.70GiB > Device unallocated:993.00MiB > Free (estimated): 58.53GiB (min: 58.53GiB) > > You're saying that I need unallocated space for new chunks to be > created, which is required by balance. > Should btrfs not take care of keeping some space for me? > Shoudln't a nigthly balance, which I'm already doing, help even more > with this? > > > > Besides adding another device to add space, is there a way around this > > > and more generally not getting into that state anymore considering that > > > I already rebalance every night? > > > > Add monitoring and alerting on the amount of unallocated space. > > > > FWIW, this is what I use for that purpose: > > > > https://packages.debian.org/sid/munin-plugins-btrfs > > https://packages.debian.org/sid/monitoring-plugins-btrfs > > > > And, of course the btrfs-heatmap program keeps being a fun tool to > > create visual timelapses of your filesystem, so you can learn how your > > usage pattern is resulting in allocation of space by btrfs, and so that > > you can visually see what the effect of your btrfs balance attempts is: > > That's interesting, but ultimately, users shoudln't have to micromanage > their filesystem to that level, even btrfs. > > a) What is wrong in my nightly script that I should fix/improve? > b) How do I recover from my current state? > > Thanks, > Marc -- Hugo Mills | You stay in the theatre because you're afraid of hugo@... carfax.org.uk | having no money? There's irony... http://carfax.org.uk/ | PGP: E2AB1DE4 | Slings and Arrows signature.asc Description: Digital signature
Re: balancing every night broke balancing so now I can't balance anymore?
Am Sun, 14 May 2017 13:15:09 -0700 schrieb Marc MERLIN: > On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote: > > On 05/13/2017 10:54 PM, Marc MERLIN wrote: > > > Kernel 4.11, btrfs-progs v4.7.3 > > > > > > I run scrub and balance every night, been doing this for 1.5 > > > years on this filesystem. > > > > What are the exact commands you run every day? > > http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html > (at the bottom) > every night: > 1) scrub > 2) balance -musage=0 > 3) balance -musage=20 > 4) balance -dusage=0 > 5) balance -dusage=20 > > > > How did I get into such a misbalanced state when I balance every > > > night? > > > > I don't know, since I don't know what you do exactly. :) > > Now you do :) > > > > My filesystem is not full, I can write just fine, but I sure > > > cannot rebalance now. > > > > Yes, because you have quite some allocated but unused space. If > > btrfs cannot just allocate more chunks, it starts trying a bit > > harder to reuse all the empty spots in the already existing > > chunks. > > Ok. shouldn't balance fix problems just like this? > I have 60GB-ish free, or in this case that's also >25%, that's a lot > > Speaking of unallocated, I have more now: > Device unallocated:993.00MiB > > This kind of just magically fixed itself during snapshot rotation and > deletion I think. > Sure enough, balance works again, but this feels pretty fragile. > Looking again: > Device size: 228.67GiB > Device allocated: 227.70GiB > Device unallocated:993.00MiB > Free (estimated): 58.53GiB (min: 58.53GiB) > > You're saying that I need unallocated space for new chunks to be > created, which is required by balance. > Should btrfs not take care of keeping some space for me? > Shoudln't a nigthly balance, which I'm already doing, help even more > with this? > > > > Besides adding another device to add space, is there a way around > > > this and more generally not getting into that state anymore > > > considering that I already rebalance every night? > > > > Add monitoring and alerting on the amount of unallocated space. > > > > FWIW, this is what I use for that purpose: > > > > https://packages.debian.org/sid/munin-plugins-btrfs > > https://packages.debian.org/sid/monitoring-plugins-btrfs > > > > And, of course the btrfs-heatmap program keeps being a fun tool to > > create visual timelapses of your filesystem, so you can learn how > > your usage pattern is resulting in allocation of space by btrfs, > > and so that you can visually see what the effect of your btrfs > > balance attempts is: > > That's interesting, but ultimately, users shoudln't have to > micromanage their filesystem to that level, even btrfs. > > a) What is wrong in my nightly script that I should fix/improve? You may want to try https://www.spinics.net/lists/linux-btrfs/msg52076.html > b) How do I recover from my current state? That script may work it's way through. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
Le 14/05/2017 à 22:15, Marc MERLIN a écrit : > On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote: >> On 05/13/2017 10:54 PM, Marc MERLIN wrote: >>> Kernel 4.11, btrfs-progs v4.7.3 >>> >>> I run scrub and balance every night, been doing this for 1.5 years on this >>> filesystem. >> What are the exact commands you run every day? > > http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html > (at the bottom) > every night: > 1) scrub > 2) balance -musage=0 > 3) balance -musage=20 > 4) balance -dusage=0 > 5) balance -dusage=20 usage=20 is pretty low: this means you don't try to reallocate and regroup together block groups that are filled more than 20%. Constantly using this settings has left lots of allocated block groups that are mostly empty on your filesystem (a little more than 20% used). The rebalance subject is a bit complex. With an empty filesystem you almost don't need it as group creation is sparse and it's OK to have mostly empty groups. When your filesystem begins to fill up you have to raise the usage target to be able to reclaim space (as the fs fills up most of your groups do too) so that new block creation can happen. I've coded one Ruby script which tries to balance between the cost of reallocating group and the need for it. The basic idea is that it tries to keep the proportion of free space "wasted" by being allocated although it isn't used below a threshold. It will bring this proportion down enough through balance that minor reallocation won't trigger a new balance right away. It should handle pathological conditions as well as possible and it won't spend more than 2 hours working on a single filesystem by default. We deploy this as a daily cron script through Puppet on all our systems and it works very well (I didn't have to use balance manually to manage free space since we did that). Note that by default it sleeps a random amount of time to avoid IO spikes on VMs running on the same host. You can either edit it or pass it "0" which will be used for the max amount of time to sleep bypassing this precaution. Here is the latest version : https://pastebin.com/Rrw1GLtx Given its current size, I should probably push it on github... I've seen other maintenance scripts mentioned on this list so you might something simpler or more targeted to your needs by browsing through the list's history. Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD (my -o ssd "summary")
On 05/14/2017 08:01 PM, Tomasz Kusmierz wrote: > All stuff that Chris wrote holds true, I just wanted to add flash > specific information (from my experience of writing low level code > for operating flash) Thanks! > [... erase ...] > In terms of over provisioning of SSD it’s a give and take > relationship … on good drive there is enough over provisioning to > allow a normal operation on systems without TRIM … now if you would > use a 1TB drive daily without TRIM and have only 30GB stored on it > you will have fantastic performance but if you will want to store > 500GB at roughly 200GB you will hit a brick wall and you writes will > slow dow to megabytes / s … this is symptom of drive running out of > over provisioning space … if you would run OS that issues trim, this > problem would not exist since drive would know that whole 970GB of > space is free and it would be pre-emptively erased days before. == ssd_spread == The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. == ssd == [And, since I didn't write a "summary post" about this issue yet, here is my version of it:] The default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with writing and deleting many files that are not too big also causes this pattern, ending up with the physical address space fully allocated and written to. My favourite videos about this: *) ssd (write pattern is small increments in /var/log/mail.log, a mail spool on /var/spool/postfix (lots of file adds and deletes), and mailman archives with a lot of little files): https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 *) The picture uses Hilbert Curve ordering (see link below) and shows the four last created DATA block groups appended together. (so a new chunk allocation pushes the others back in the picture). https://github.com/knorrie/btrfs-heatmap/blob/master/doc/curves.md * What the ssd mode does, is simply setting a lower boundary to the size of free space fragments that are reused. * In combination with always trying to walk forward inside a block group, not looking back at freed up space, it fills up with a shotgun blast pattern when you do writes and deletes all the time. * When a write comes in that is bigger than any free space part left behind, a new chunk gets allocated, and the bad pattern continues in there. * Because it keeps allocating more and more new chunks, and keeps circling around in the latest one, until a big write is done, it leaves mostly empty ones behind. * Without 'discard', the SSD will never learn that all the free space left behind is actually free. * Eventually all raw disk space is allocated, and users run into problems with ENOSPC and balance etc. So, enabling this ssd mode actually means it starts choking itself to death here. When users see this effect, they start scheduling balance operations, to compact free space to bring the amount of allocated but unused space down a bit. * But, doing that is causing just more and more writes to the ssd. * Also, since balance takes a "usage" argument and not a "how badly fragmented" argument, it's causing lots of unnecessary rewriting of data. * And, with a decent amount (like a few thousand) subvolumes, all having a few snapshots of their own, the ratio data:metadata written during balance is skyrocketing, causing not only the data to be rewritten, but also causing pushing out lots of metadata to the ssd. (example: on my backup server rewriting 1GiB of data causes writing of >40GiB of metadata, where probably 99.99% of those writes are some kind of intermediary writes which are immediately invalidated during the next btrfs transaction that is done). All in all, this reminds me of the series "breaking bad", where every step taken to try fix things, only made things worse. At every bullet point above, this is also happening. == nossd == nossd mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 > And last part - hard drive is not aware of filesystem and partitions > … so you could have 400GB on this 1TB drive left unpartitioned and > still you would be cooked. Technically speaking using as much as > possible space on a SSD to a FS and OS that supports trim will give > you best performance because drive will be notified of as much as > possible disk space that is actually free ….. > > So, to summaries: > - don’t try to outsmart built in mechanics of SSD (people that > suggest that are just morons that want to have 5 minutes of fame). This is exactly what the btrfs ssd options are trying to do. Still, I don't think it's very nice to call Chris
Re: balancing every night broke balancing so now I can't balance anymore?
On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote: > On 05/13/2017 10:54 PM, Marc MERLIN wrote: > > Kernel 4.11, btrfs-progs v4.7.3 > > > > I run scrub and balance every night, been doing this for 1.5 years on this > > filesystem. > > What are the exact commands you run every day? http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html (at the bottom) every night: 1) scrub 2) balance -musage=0 3) balance -musage=20 4) balance -dusage=0 5) balance -dusage=20 > > How did I get into such a misbalanced state when I balance every night? > > I don't know, since I don't know what you do exactly. :) Now you do :) > > My filesystem is not full, I can write just fine, but I sure cannot > > rebalance now. > > Yes, because you have quite some allocated but unused space. If btrfs > cannot just allocate more chunks, it starts trying a bit harder to reuse > all the empty spots in the already existing chunks. Ok. shouldn't balance fix problems just like this? I have 60GB-ish free, or in this case that's also >25%, that's a lot Speaking of unallocated, I have more now: Device unallocated: 993.00MiB This kind of just magically fixed itself during snapshot rotation and deletion I think. Sure enough, balance works again, but this feels pretty fragile. Looking again: Device size: 228.67GiB Device allocated:227.70GiB Device unallocated: 993.00MiB Free (estimated): 58.53GiB (min: 58.53GiB) You're saying that I need unallocated space for new chunks to be created, which is required by balance. Should btrfs not take care of keeping some space for me? Shoudln't a nigthly balance, which I'm already doing, help even more with this? > > Besides adding another device to add space, is there a way around this > > and more generally not getting into that state anymore considering that > > I already rebalance every night? > > Add monitoring and alerting on the amount of unallocated space. > > FWIW, this is what I use for that purpose: > > https://packages.debian.org/sid/munin-plugins-btrfs > https://packages.debian.org/sid/monitoring-plugins-btrfs > > And, of course the btrfs-heatmap program keeps being a fun tool to > create visual timelapses of your filesystem, so you can learn how your > usage pattern is resulting in allocation of space by btrfs, and so that > you can visually see what the effect of your btrfs balance attempts is: That's interesting, but ultimately, users shoudln't have to micromanage their filesystem to that level, even btrfs. a) What is wrong in my nightly script that I should fix/improve? b) How do I recover from my current state? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking ome page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
On 05/13/2017 10:54 PM, Marc MERLIN wrote: > Kernel 4.11, btrfs-progs v4.7.3 > > I run scrub and balance every night, been doing this for 1.5 years on this > filesystem. What are the exact commands you run every day? > But it has just started failing: > [...] > saruman:~# btrfs fi usage /mnt/btrfs_pool1/ > Overall: > Device size: 228.67GiB > Device allocated: 228.67GiB > Device unallocated: 1.00MiB > Device missing:0.00B > Used: 171.25GiB > Free (estimated): 55.32GiB (min: 55.32GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve:512.00MiB (used: 0.00B) > > Data,single: Size:221.60GiB, Used:166.28GiB >/dev/mapper/pool1 221.60GiB > > Metadata,single: Size:7.03GiB, Used:4.96GiB >/dev/mapper/pool1 7.03GiB > > System,single: Size:32.00MiB, Used:48.00KiB >/dev/mapper/pool132.00MiB > > Unallocated: >/dev/mapper/pool1 1.00MiB > > How did I get into such a misbalanced state when I balance every night? I don't know, since I don't know what you do exactly. :) > My filesystem is not full, I can write just fine, but I sure cannot > rebalance now. Yes, because you have quite some allocated but unused space. If btrfs cannot just allocate more chunks, it starts trying a bit harder to reuse all the empty spots in the already existing chunks. > Besides adding another device to add space, is there a way around this > and more generally not getting into that state anymore considering that > I already rebalance every night? Add monitoring and alerting on the amount of unallocated space. FWIW, this is what I use for that purpose: https://packages.debian.org/sid/munin-plugins-btrfs https://packages.debian.org/sid/monitoring-plugins-btrfs And, of course the btrfs-heatmap program keeps being a fun tool to create visual timelapses of your filesystem, so you can learn how your usage pattern is resulting in allocation of space by btrfs, and so that you can visually see what the effect of your btrfs balance attempts is: https://github.com/knorrie/btrfs-heatmap/ https://packages.debian.org/sid/btrfs-heatmap https://apps.fedoraproject.org/packages/btrfs-heatmap https://aur.archlinux.org/packages/python-btrfs-heatmap/ -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
All stuff that Chris wrote holds true, I just wanted to add flash specific information (from my experience of writing low level code for operating flash) So with flash, to erase you have to erase a large allocation block, usually it used to be 128kB (plus some crc data and stuff makes more than 128kB, but we are talking functional data storage space) on never setups it can be megabytes … device dependant really. To erase a block you need to provide whole 128 x 8 bits with voltage higher that is usually used for IO (can be even 15V) so it requires an external supply or build in internal charge pump to provide that voltage to a block erasure circuitry. This process generates a lot of heat and requires a lot of energy, so consensus back in the day was that you could erase one block at a time and this could take up to 200ms (0.2 second). After a erase you need to check whenever all bits are set to 1 (charged state) and then sector is marked as ready for storage. Of course, flash memories are moving forward and in more demanding environments there are solutions where blocks are grouped into groups which have separate eraser circuits that will allow errasure to be performed in parallel in multiple parts of flash module, still you are bound to one per group. Another problem is that erasure procedure locally does increase temperature and on flat flashes it’s not that much of a problem, but on emerging solutions like 3d flashed locally we might experience undesired temperature increases that would either degrade life span of flash or simply erase neighbouring blocks. In terms of over provisioning of SSD it’s a give and take relationship … on good drive there is enough over provisioning to allow a normal operation on systems without TRIM … now if you would use a 1TB drive daily without TRIM and have only 30GB stored on it you will have fantastic performance but if you will want to store 500GB at roughly 200GB you will hit a brick wall and you writes will slow dow to megabytes / s … this is symptom of drive running out of over provisioning space … if you would run OS that issues trim, this problem would not exist since drive would know that whole 970GB of space is free and it would be pre-emptively erased days before. And last part - hard drive is not aware of filesystem and partitions … so you could have 400GB on this 1TB drive left unpartitioned and still you would be cooked. Technically speaking using as much as possible space on a SSD to a FS and OS that supports trim will give you best performance because drive will be notified of as much as possible disk space that is actually free ….. So, to summaries: - don’t try to outsmart built in mechanics of SSD (people that suggest that are just morons that want to have 5 minutes of fame). - don’t buy crap SSD and expect it to behave like good one if you use below certain % of it … it’s stupid, buy more reasonable SSD but smaller and store slow data on spinning rust. - read more books and wikipedia, not jumping down on you but internet is filled with people that provide false information, sometimes unknowingly and swear by it ( Dunning–Kruger effect :D ) and some of them are very good and making all theories sexy and stuff … you simply have to get used to it… - if something is to good to be true, than it’s not - promise of future performance gains is a domain of “sleazy salesman" > On 14 May 2017, at 17:21, Chris Murphywrote: > > On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote: > >> When I was doing my ssd research the first time around, the going >> recommendation was to keep 20-33% of the total space on the ssd entirely >> unallocated, allowing it to use that space as an FTL erase-block >> management pool. > > Any brand name SSD has its own reserve above its specified size to > ensure that there's decent performance, even when there is no trim > hinting supplied by the OS; and thereby the SSD can only depend on LBA > "overwrites" to know what blocks are to be freed up. > > >> Anyway, that 20-33% left entirely unallocated/unpartitioned >> recommendation still holds, right? > > Not that I'm aware of. I've never done this by literally walling off > space that I won't use. IA fairly large percentage of my partitions > have free space so it does effectively happen as far as the SSD is > concerned. And I use fstrim timer. Most of the file systems support > trim. > > Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file > system that would not issue trim commands on this drive, and it was > doing full performance writes through that point. Then deleted maybe > 5% of the files, and then refill the drive to 98% again, and it was > the same performance. So it must have had enough in reserve to permit > full performance "overwrites" which were in effect directed to reserve > blocks as the freed up blocks were being erased. Thus the erasure > happening on the fly
Re: 4.11: da_remove called for id=16 which is not allocated.
My apologies, this was for the bcache list, sorry about this. On Sun, May 14, 2017 at 08:25:22AM -0700, Marc MERLIN wrote: > > gargamel:/sys/block/bcache16/bcache# echo 1 > stop > > bcache: bcache_device_free() bcache16 stopped > [ cut here ] > WARNING: CPU: 7 PID: 11051 at lib/idr.c:383 ida_remove+0xe8/0x10b > ida_remove called for id=16 which is not allocated. > Modules linked in: uas usb_storage veth ip6table_filter ip6_tables > ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun > autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace > fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat > xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 > hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 > nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE > nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss > snd_mixer_oss bcache kvm_intel kvm irqbypass snd_hda_codec_realtek > snd_hda_codec_generic snd_cmipci snd_hda_intel snd_mpu401_uart snd_opl3_lib > snd_hda_codec snd_rawmidi snd_hda_core rc_ati_x10 snd_hwdep snd_seq_device > ati_remote > snd_pcm eeepc_wmi asix snd_timer usbserial asus_wmi usbnet rc_core snd > sparse_keymap libphy rfkill hwmon lpc_ich soundcore mei_me parport_pc wmi > tpm_infineon parport tpm_tis i2c_i801 battery input_leds tpm_tis_core i915 > tpm pcspkr evdev e1000e ptp pps_core fuse raid456 multipath mmc_block > mmc_core lrw ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor > async_memcpy async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc > aesni_intel aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci > xhci_hcd ehci_hcd r8169 sata_sil24 mii usbcore thermal mvsas fan libsas > scsi_transport_sas [last unloaded: ftdi_sio] > CPU: 7 PID: 11051 Comm: kworker/7:13 Tainted: G U W > 4.11.0-amd64-preempt-sysrq-20170406 #2 > Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 > 04/27/2013 > Workqueue: events cached_dev_free [bcache] > Call Trace: > dump_stack+0x61/0x7d > __warn+0xc2/0xdd > warn_slowpath_fmt+0x5a/0x76 > ida_remove+0xe8/0x10b > ida_simple_remove+0x2e/0x43 > bcache_device_free+0x8c/0xc4 [bcache] > cached_dev_free+0x6b/0xe1 [bcache] > process_one_work+0x193/0x2b0 > worker_thread+0x1e9/0x2c1 > ? rescuer_thread+0x2b1/0x2b1 > kthread+0xfb/0x100 > ? init_completion+0x24/0x24 > ret_from_fork+0x2c/0x40 > ---[ end trace 12586d8b165ff8f2 ]--- > > Prior to that: > cd /sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1 > echo 1 > stop > > I needed to complete stop and remove all traces of a bcache before I could > mdadm --stop the underlying array. > > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 1024R/763BE901 -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs fi usage crash when multiple device volume contains seed device
Hi, Chris Murphy suggested we move the discussion in this bugzilla thread: https://bugzilla.kernel.org/show_bug.cgi?id=115851 To here, the mailing list. Going to quote him to give context: "This might be better discussed on list to ensure there's congruence in dev and user expectations; and in particular I think this needs a design that accounts for the realistic long term goal, so that a short term scope doesn't interfere or make the long term more difficult. The matrix of possibilities, most of which are not yet implemented in btrfs-progs: 1. 1 dev seed -> 1 dev sprout 2. 2+ dev seed -> 1 dev sprout 3. 1 dev seed -> 2+ dev sprout 4. 2+ dev seed -> 2+ dev sprout Near as I can tell 2, 3, 4 are not implemented. It's an immediate problem whether and how the profile (single, raid0, raid1) is to be inherited from seed to sprout. If I have a 4 disk raid1 volume, to create a sprout must I add a minimum of two devices? Or is it valid to have raid1 profile seed chunks, where writes to go single profile sprout chunks? Anyway point is, it needs a design to answer these things. Next, and even more importantly as it applies to the simple case of single to single, the way we do this right now is beyond confusing because the remount ro to rw changes the volume UUID being mounted. The ro mount is the seed, the rw mount is the sprout. This is not really a remount, it's a umount of the seed, and a mount of the sprout. But what if there's more than one sprout? This is asking for trouble so I think the remount rw should be disallowed making it clear the ro seed cannot be mounted rw. Instead it's necessary to umount it and explicitly mount the rw sprout, and which sprout. Also part of the ambiguity is that 'btrfs dev add' is more like mkfs.btrfs in the context of seed-sprout. The new device isn't really added to the seed, because the seed is read only. What's really happening is a mkfs.btrfs with a "backing device" which is the seed; in some sense it has more in common with the mkfs.btrfs --rootdir option. So I even wonder if 'btrfs dev add' is appropriate for creating sprouts, and if instead it should be in mkfs.btfs with a --seed option to specify the backing seed, and thereby what we are making is a sprout, which has a new UUID, and possibly different chunk profiles than the seed." Thanks, Luis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote: > When I was doing my ssd research the first time around, the going > recommendation was to keep 20-33% of the total space on the ssd entirely > unallocated, allowing it to use that space as an FTL erase-block > management pool. Any brand name SSD has its own reserve above its specified size to ensure that there's decent performance, even when there is no trim hinting supplied by the OS; and thereby the SSD can only depend on LBA "overwrites" to know what blocks are to be freed up. > Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? Not that I'm aware of. I've never done this by literally walling off space that I won't use. IA fairly large percentage of my partitions have free space so it does effectively happen as far as the SSD is concerned. And I use fstrim timer. Most of the file systems support trim. Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file system that would not issue trim commands on this drive, and it was doing full performance writes through that point. Then deleted maybe 5% of the files, and then refill the drive to 98% again, and it was the same performance. So it must have had enough in reserve to permit full performance "overwrites" which were in effect directed to reserve blocks as the freed up blocks were being erased. Thus the erasure happening on the fly was not inhibiting performance on this SSD. Now had I gone to 99.9% full, and then delete say 1GiB, and then started going a bunch of heavy small file writes rather than sequential? I don't know what would happening, it might have choked because this is a lot more work for the SSD to deal with heavy IOPS and erasure. It will invariably be something that's very model and even firmware version specific. > Am I correct in asserting that if one > is following that, the FTL already has plenty of erase-blocks available > for management and the discussion about filesystem level trim and free > space management becomes much less urgent, tho of course it's still worth > considering if it's convenient to do so? Most file systems don't direct writes to new areas, they're fairly prone to overwriting. So the firmware is going to get notified fairly quickly with either trim or an overwrite, which LBAs are stale. It's probably more important with Btrfs which has more variable behavior, it can continue to direct new writes to recently allocated chunks before it'll do overwrites in older chunks that have free space. > And am I also correct in believing that while it's not really worth > spending more to over-provision to the near 50% as I ended up doing, if > things work out that way as they did with me because the difference in > price between 30% overprovisioning and 50% overprovisioning ends up being > trivial, there's really not much need to worry about active filesystem > trim at all, because the FTL has effectively half the device left to play > erase-block musical chairs with as it decides it needs to? I think it's not worth to overprovision by default ever. Use all of that space until you have a problem. If you have a 256G drive, you paid to get the spec performance for 100% of those 256G. You did not pay that company to second guess things and have cut it slack by overprovisioning from the outset. I don't know how long it takes for erasure to happen though, so I have no idea how much overprovisioning is really needed at the write rate of the drive, so that it can erase at the same rate as writes, in order to avoid a slow down. I guess an even worse test would be one that intentionally fragments across erase block boundaries, forcing the firmware to be unable to do erasures without first migrating partially full blocks in order to make them empty, so they can then be erased, and now be used for new writes. That sort of shuffling is what will separate the good from average drives, and why the drives have multicore CPUs on them, as well as most now having on the fly always on encryption. Even completely empty, some of these drives have a short term higher speed write which falls back to a lower speed as the fast flash gets full. After some pause that fast write capability is restored for future writes. I have no idea if this is separate kind of flash on the drive, or if it's just a difference in encoding data onto the flash that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D VNAND. That sounds like an encoding method; it's fast but inefficient and probably needs reencoding. But that's the thing, the firmware is really complicated now. I kinda wonder if f2fs could be chopped down to become a modular allocator for the existing file systems; activate that allocation method with "ssd" mount option rather than whatever overly smart thing it does today that's based on assumptions that are now likely outdated. -- Chris Murphy -- To unsubscribe from this list: send the line
4.11: da_remove called for id=16 which is not allocated.
gargamel:/sys/block/bcache16/bcache# echo 1 > stop bcache: bcache_device_free() bcache16 stopped [ cut here ] WARNING: CPU: 7 PID: 11051 at lib/idr.c:383 ida_remove+0xe8/0x10b ida_remove called for id=16 which is not allocated. Modules linked in: uas usb_storage veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic snd_cmipci snd_hda_intel snd_mpu401_uart snd_opl3_lib snd_hda_codec snd_rawmidi snd_hda_core rc_ati_x10 snd_hwdep snd_seq_device ati_remote snd_pcm eeepc_wmi asix snd_timer usbserial asus_wmi usbnet rc_core snd sparse_keymap libphy rfkill hwmon lpc_ich soundcore mei_me parport_pc wmi tpm_infineon parport tpm_tis i2c_i801 battery input_leds tpm_tis_core i915 tpm pcspkr evdev e1000e ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci xhci_hcd ehci_hcd r8169 sata_sil24 mii usbcore thermal mvsas fan libsas scsi_transport_sas [last unloaded: ftdi_sio] CPU: 7 PID: 11051 Comm: kworker/7:13 Tainted: G U W 4.11.0-amd64-preempt-sysrq-20170406 #2 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cached_dev_free [bcache] Call Trace: dump_stack+0x61/0x7d __warn+0xc2/0xdd warn_slowpath_fmt+0x5a/0x76 ida_remove+0xe8/0x10b ida_simple_remove+0x2e/0x43 bcache_device_free+0x8c/0xc4 [bcache] cached_dev_free+0x6b/0xe1 [bcache] process_one_work+0x193/0x2b0 worker_thread+0x1e9/0x2c1 ? rescuer_thread+0x2b1/0x2b1 kthread+0xfb/0x100 ? init_completion+0x24/0x24 ret_from_fork+0x2c/0x40 ---[ end trace 12586d8b165ff8f2 ]--- Prior to that: cd /sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1 echo 1 > stop I needed to complete stop and remove all traces of a bcache before I could mdadm --stop the underlying array. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
On Sat, May 13, 2017 at 6:41 PM, Andreas Dilgerwrote: > On May 10, 2017, at 11:10 PM, Eric Biggers wrote: >> >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang >>> out] >> Yes, PIDs have traditionally been global, but today we have PID namespaces, >> and >> many other isolation features such as mount namespaces. Nothing is perfect, >> of >> course, and containers are a lot worse than VMs, but it seems weird to use >> that >> as an excuse to knowingly make things worse... >> Indeed. Not only PID namespaces -- we have hidepid and we can simply unmount /proc. "There are other info leaks" is a poor excuse. >>> > Fortunately, the days of timesharing seem to well behind us. For > those people who think that containers are as secure as VM's (hah, > hah, hah), it might be that best way to handle this is to have a mount > option that requires root access to this functionality. For those > people who really care about this, they can disable access. >>> >>> Or use separate filesystems for each container so that exploitable bugs >>> that shut down the filesystem can't be used to kill the other >>> containers. You could use a torrent of metadata-heavy operations >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >>> the other containers. >>> What would be the reason for not putting this behind capable(CAP_SYS_ADMIN)? What possible legitimate function could this functionality serve to users who don't own your filesystem? >>> >>> As I've said before, it's to enable dedupe tools to decide, given a set >>> of files with shareable blocks, roughly how many other times each of >>> those shareable blocks are shared so that they can make better decisions >>> about which file keeps its shareable blocks, and which file gets >>> remapped. Dedupe is not a privileged operation, nor are any of the >>> tools. >>> >> >> So why does the ioctl need to return all extent mappings for the entire >> filesystem, instead of just the share count of each block in the file that >> the >> ioctl is called on? > > One possibility is that the ioctl() can return the mapping for all inodes > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more > than one if there is a reason to do so) with all the other allocated blocks > for inodes the user doesn't have permission to access? Sounds like it could be reasonable. But you don't want "owned by the calling PID" precisely -- you also need to check kgid_has_mapping(current_user_ns(), inode->i_gid), I think. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Imran Geriskovan posted on Fri, 12 May 2017 15:02:20 +0200 as excerpted: > On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: >> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple >> weeks ago so went back to find it. Thanks. =:^) >> >> (I'm currently still on quarter-TB generation ssds, plus spinning rust >> for the larger media partition and backups, and want to be rid of the >> spinning rust, so am looking at half-TB to TB, which seems to be the >> pricing sweet spot these days anyway.) > > Since you are taking ssds to mainstream based on your experience, > I guess your perception of data retension/reliability is better than > that of spinning rust. Right? Can you eloborate? > > Or an other criteria might be physical constraints of spinning rust on > notebooks which dictates that you should handle the device with care > when running. > > What was your primary motivation other than performance? Well, the /immediate/ motivation is that the spinning rust is starting to hint that it's time to start thinking about rotating it out of service... It's my main workstation so wall powered, but because it's the media and secondary backups partitions, I don't have anything from it mounted most of the time and because it /is/ spinning rust, I allow it to spin down. It spins right back up if I mount it, and reads seem to be fine, but if I let it set a bit after mount, possibly due to it spinning down again, sometimes I get write errors, SATA resets, etc. Sometimes the write will then eventually appear to go thru, sometimes not, but once this happens, unmounting often times out, and upon a remount (which may or may not work until a clean reboot), the last writes may or may not still be there. And the smart info, while not bad, does indicate it's starting to age, tho not extremely so. Now even a year ago I'd have likely played with it, adjusting timeouts, spindowns, etc, attempting to get it working normally again. But they say that ssd performance spoils you and you don't want to go back, and while it's a media drive and performance isn't normally an issue, those secondary backups to it as spinning rust sure take a lot longer than the primary backups to other partitions on the same pair of ssds that the working copies (of everything but media) are on. Which means I don't like to do them... which means sometimes I put them off longer than I should. Basically, it's another application of my "don't make it so big it takes so long to maintain you don't do it as you should" rule, only here, it's not the size but rather because I've been spoiled by the performance of the ssds. So couple the aging spinning rust with the fact that I've really wanted to put media and the backups on ssd all along, only it couldn't be cost- justified a few years ago when I bought the original ssds, and I now have my excuse to get the now cheaper ssds I really wanted all along. =:^) As for reliability... For archival usage I still think spinning rust is more reliable, and certainly more cost effective. However, for me at least, with some real-world ssd experience under my belt now, including an early slow failure (more and more blocks going bad, I deliberately kept running it in btrfs raid1 mode with scrubs handling the bad blocks for quite some time, just to get the experience both with ssds and with btrfs) and replacement of one of the ssds with one I had originally bought for a different machine (my netbook, which went missing shortly thereafter), I now find ssds reliable enough for normal usage, certainly so if the data is valuable enough to have backups of it anyway, and if it's not valuable enough to be worth doing backups, then losing it is obviously not a big deal, because it's self-evidently worth less than the time, trouble and resources of doing that backup. Particularly so if the speed of ssds helpfully encourages you to keep the backups more current than you would otherwise. =:^) But spinning rust remains appropriate for long-term archival usage, like that third-level last-resort backup I like to make, then keep on the shelf, or store with a friend, or in a safe deposit box, or whatever, and basically never use, but like to have just in case. IOW, that almost certainly write once, read-never, seldom update, last resort backup. If three years down the line there's a fire/flood/whatever, and all I can find in the ashes/mud or retrieve from that friend is that three year old backup, I'll be glad to still have it. Of course those who have multi-TB scale data needs may still find spinning rust useful as well, because while 4-TB ssds are available now, they're /horribly/ expensive. But with 3D-NAND, even that use-case looks like it may go ssd in the next five years or so, leaving multi-year to decade-plus archiving, and perhaps say 50-TB-plus, but that's going to take long enough to actually write or otherwise do anything with it's effectively
Re: balancing every night broke balancing so now I can't balance anymore?
Marc MERLIN posted on Sat, 13 May 2017 13:54:31 -0700 as excerpted: > Kernel 4.11, btrfs-progs v4.7.3 > > I run scrub and balance every night, been doing this for 1.5 years on > this filesystem. > But it has just started failing: > saruman:~# btrfs balance start -musage=0 /mnt/btrfs_pool1 > Done, had to relocate 0 out of 235 chunks > saruman:~# btrfs balance start -dusage=0 > /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks Those aren't failing (as you likely know, but to explain for others following along), there's nothing to do as there's no entirely empty chunks. But... > saruman:~# btrfs balance start -musage=1 /mnt/btrfs_pool1 > ERROR: error during balancing '/mnt/btrfs_pool1': > No space left on device > aruman:~# btrfs balance start -dusage=10 /mnt/btrfs_pool1 > Done, had to relocate 0 out of 235 chunks > saruman:~# btrfs balance start -dusage=20 /mnt/btrfs_pool1 > ERROR: error during balancing '/mnt/btrfs_pool1': > No space left on device ... Errors there. ENOSPC [from dmesg] > BTRFS info (device dm-2): 1 enospc errors during balance > BTRFS info (device dm-2): relocating block group 598566305792 flags data > BTRFS info (device dm-2): 1 enospc errors during balance > BTRFS info (device dm-2): 1 enospc errors during balance > BTRFS info (device dm-2): relocating block group 598566305792 flags data > BTRFS info (device dm-2): 1 enospc errors during balance > saruman:~# btrfs fi show /mnt/btrfs_pool1/ > Label: 'btrfs_pool1' uuid: bc115001-a8d1-445c-9ec9-6050620efd0a > Total devices 1 FS bytes used 169.73GiB > devid1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1 > saruman:~# btrfs fi usage /mnt/btrfs_pool1/ > Overall: > Device size: 228.67GiB > Device allocated: 228.67GiB > Device unallocated: 1.00MiB > Device missing:0.00B > Used: 171.25GiB > Free (estimated): 55.32GiB (min: 55.32GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve:512.00MiB (used: 0.00B) > > Data,single: Size:221.60GiB, Used:166.28GiB >/dev/mapper/pool1 221.60GiB > > Metadata,single: Size:7.03GiB, Used:4.96GiB >/dev/mapper/pool1 7.03GiB > > System,single: Size:32.00MiB, Used:48.00KiB >/dev/mapper/pool132.00MiB > > Unallocated: >/dev/mapper/pool1 1.00MiB So we see it's fully chunk-allocated, no unallocated space, but gigs and gigs of empty space withing the chunk allocations, data chunks in particular. > How did I get into such a misbalanced state when I balance every night? > > My filesystem is not full, I can write just fine, but I sure cannot > rebalance now. Well, you can write just fine... for now. After accounting for the global reserve coming out of metadata's reported free, there's about 1.5 GiB space in the metadata, and about 55 GiB of space in the data, so you should actually be able to write for some time before running out of either. You just can't rebalance to chunk-defrag and reclaim chunks to unallocated, so they can be used for the other chunk type if necessary. You're correct to be worried about this, but it's not immediately urgent. > Besides adding another device to add space, is there a way around this > and more generally not getting into that state anymore considering that > I already rebalance every night? What you /haven't/ yet said is what your nightly rebalance command, presumably scheduled, with -dusage and -musage, actually is. How did you determine the usage amount to feed to the command, and was it dynamic, presumably determined by some script and changing based on the amount of unutilized space trapped within the data chunks, or static, the same usage command given every nite? The other thing we don't have, and you might not have any idea either if it was simply scheduled and you hadn't been specifically checking, is a trendline of whether the post-balance unallocated space has been reducing over time, while the post-balance unutilized space within the data chunks was growing, or whether it happened all of a sudden. If you've been following current discussion threads here, you may already know one possible specific trigger, as discussed, and more generically, there could be other specific triggers in the same general category. In that thread the specific culprit appeared to be btrfs behavior with the (autodetected based on device rotational value as reported by sysfs) ssd mount option, in particular as it interacted with systemd's journal files, but it would apply to anything else with a similar write pattern. The overall btrfs usage pattern was problematic as much like you apparently were getting but didn't catch before full allocation while he did, btrfs was continuing to allocate new chunks, even tho there was plenty of space left within existing chunks, none of which were entirely empty (so