Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden 

> On 15 May 2017, at 00:01, Imran Geriskovan  wrote:
> 
> On 5/14/17, Tomasz Kusmierz  wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden  


> On 15 May 2017, at 00:01, Imran Geriskovan  wrote:
> 
> On 5/14/17, Tomasz Kusmierz  wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Marc MERLIN
On Sun, May 14, 2017 at 09:21:11PM +, Hugo Mills wrote:
> > 2) balance -musage=0
> > 3) balance -musage=20
> 
>In most cases, this is going to make ENOSPC problems worse, not
> better. The reason for doign this kind of balance is to recover unused
> space and allow it to be reallocated. The typical behaviour is that
> data gets overallocated, and it's metadata which runs out. So, the
> last thing you want to be doing is reducing the metadata allocation,
> because that's the scarce resource.
> 
>Also, I'd usually recommend using limit=n, where n is approximately
> the amount of data overallcation (allocated space less used
> space). It's much more controllable than usage.


Thanks for that.
So, would you just remove the balance -musage=20 altogether?

As for limit= I'm not sure if it would be helpful since I run this
nightly. Anything that doesn't get done tonight due to limit, would be
done tomorrow?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Lionel Bouton
Le 14/05/2017 à 23:30, Kai Krakow a écrit :
> Am Sun, 14 May 2017 22:57:26 +0200
> schrieb Lionel Bouton :
>
>> I've coded one Ruby script which tries to balance between the cost of
>> reallocating group and the need for it.[...]
>> Given its current size, I should probably push it on github...
> Yes, please... ;-)

Most of our BTRFS filesystems are used by Ceph OSD, so here it is :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Imran Geriskovan
On 5/14/17, Tomasz Kusmierz  wrote:
> In terms of over provisioning of SSD it’s a give and take relationship … on
> good drive there is enough over provisioning to allow a normal operation on
> systems without TRIM … now if you would use a 1TB drive daily without TRIM
> and have only 30GB stored on it you will have fantastic performance but if
> you will want to store 500GB at roughly 200GB you will hit a brick wall and
> you writes will slow dow to megabytes / s … this is symptom of drive running
> out of over provisioning space …

What exactly happens on a non-trimmed drive?
Does it begin to forge certain erase-blocks? If so
which are those? What happens when you never
trim and continue dumping data on it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Kai Krakow
Am Sun, 14 May 2017 22:57:26 +0200
schrieb Lionel Bouton :

> I've coded one Ruby script which tries to balance between the cost of
> reallocating group and the need for it. The basic idea is that it
> tries to keep the proportion of free space "wasted" by being allocated
> although it isn't used below a threshold. It will bring this
> proportion down enough through balance that minor reallocation won't
> trigger a new balance right away. It should handle pathological
> conditions as well as possible and it won't spend more than 2 hours
> working on a single filesystem by default. We deploy this as a daily
> cron script through Puppet on all our systems and it works very well
> (I didn't have to use balance manually to manage free space since we
> did that). Note that by default it sleeps a random amount of time to
> avoid IO spikes on VMs running on the same host. You can either edit
> it or pass it "0" which will be used for the max amount of time to
> sleep bypassing this precaution.
> 
> Here is the latest version : https://pastebin.com/Rrw1GLtx
> Given its current size, I should probably push it on github...

Yes, please... ;-)

> I've seen other maintenance scripts mentioned on this list so you
> might something simpler or more targeted to your needs by browsing
> through the list's history.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Hugo Mills
On Sun, May 14, 2017 at 01:15:09PM -0700, Marc MERLIN wrote:
> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> > On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> > > Kernel 4.11, btrfs-progs v4.7.3
> > > 
> > > I run scrub and balance every night, been doing this for 1.5 years on this
> > > filesystem.
> > 
> > What are the exact commands you run every day?
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20

   In most cases, this is going to make ENOSPC problems worse, not
better. The reason for doign this kind of balance is to recover unused
space and allow it to be reallocated. The typical behaviour is that
data gets overallocated, and it's metadata which runs out. So, the
last thing you want to be doing is reducing the metadata allocation,
because that's the scarce resource.

   Also, I'd usually recommend using limit=n, where n is approximately
the amount of data overallcation (allocated space less used
space). It's much more controllable than usage.

   Hugo.

> 4) balance -dusage=0
> 5) balance -dusage=20
> 
> > > How did I get into such a misbalanced state when I balance every night?
> > 
> > I don't know, since I don't know what you do exactly. :)
>  
> Now you do :)
> 
> > > My filesystem is not full, I can write just fine, but I sure cannot
> > > rebalance now.
> > 
> > Yes, because you have quite some allocated but unused space. If btrfs
> > cannot just allocate more chunks, it starts trying a bit harder to reuse
> > all the empty spots in the already existing chunks.
> 
> Ok. shouldn't balance fix problems just like this?
> I have 60GB-ish free, or in this case that's also >25%, that's a lot
> 
> Speaking of unallocated, I have more now:
> Device unallocated:993.00MiB
> 
> This kind of just magically fixed itself during snapshot rotation and
> deletion I think.
> Sure enough, balance works again, but this feels pretty fragile.
> Looking again:
> Device size:   228.67GiB
> Device allocated:  227.70GiB
> Device unallocated:993.00MiB
> Free (estimated):   58.53GiB  (min: 58.53GiB)
> 
> You're saying that I need unallocated space for new chunks to be
> created, which is required by balance.
> Should btrfs not take care of keeping some space for me?
> Shoudln't a nigthly balance, which I'm already doing, help even more
> with this?
> 
> > > Besides adding another device to add space, is there a way around this
> > > and more generally not getting into that state anymore considering that
> > > I already rebalance every night?
> > 
> > Add monitoring and alerting on the amount of unallocated space.
> > 
> > FWIW, this is what I use for that purpose:
> > 
> > https://packages.debian.org/sid/munin-plugins-btrfs
> > https://packages.debian.org/sid/monitoring-plugins-btrfs
> > 
> > And, of course the btrfs-heatmap program keeps being a fun tool to
> > create visual timelapses of your filesystem, so you can learn how your
> > usage pattern is resulting in allocation of space by btrfs, and so that
> > you can visually see what the effect of your btrfs balance attempts is:
> 
> That's interesting, but ultimately, users shoudln't have to micromanage
> their filesystem to that level, even btrfs.
> 
> a) What is wrong in my nightly script that I should fix/improve?
> b) How do I recover from my current state?
> 
> Thanks,
> Marc

-- 
Hugo Mills | You stay in the theatre because you're afraid of
hugo@... carfax.org.uk | having no money? There's irony...
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Slings and Arrows


signature.asc
Description: Digital signature


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Kai Krakow
Am Sun, 14 May 2017 13:15:09 -0700
schrieb Marc MERLIN :

> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> > On 05/13/2017 10:54 PM, Marc MERLIN wrote:  
> > > Kernel 4.11, btrfs-progs v4.7.3
> > > 
> > > I run scrub and balance every night, been doing this for 1.5
> > > years on this filesystem.  
> > 
> > What are the exact commands you run every day?  
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20
> 
> > > How did I get into such a misbalanced state when I balance every
> > > night?  
> > 
> > I don't know, since I don't know what you do exactly. :)  
>  
> Now you do :)
> 
> > > My filesystem is not full, I can write just fine, but I sure
> > > cannot rebalance now.  
> > 
> > Yes, because you have quite some allocated but unused space. If
> > btrfs cannot just allocate more chunks, it starts trying a bit
> > harder to reuse all the empty spots in the already existing
> > chunks.  
> 
> Ok. shouldn't balance fix problems just like this?
> I have 60GB-ish free, or in this case that's also >25%, that's a lot
> 
> Speaking of unallocated, I have more now:
> Device unallocated:993.00MiB
> 
> This kind of just magically fixed itself during snapshot rotation and
> deletion I think.
> Sure enough, balance works again, but this feels pretty fragile.
> Looking again:
> Device size:   228.67GiB
> Device allocated:  227.70GiB
> Device unallocated:993.00MiB
> Free (estimated):   58.53GiB  (min: 58.53GiB)
> 
> You're saying that I need unallocated space for new chunks to be
> created, which is required by balance.
> Should btrfs not take care of keeping some space for me?
> Shoudln't a nigthly balance, which I'm already doing, help even more
> with this?
> 
> > > Besides adding another device to add space, is there a way around
> > > this and more generally not getting into that state anymore
> > > considering that I already rebalance every night?  
> > 
> > Add monitoring and alerting on the amount of unallocated space.
> > 
> > FWIW, this is what I use for that purpose:
> > 
> > https://packages.debian.org/sid/munin-plugins-btrfs
> > https://packages.debian.org/sid/monitoring-plugins-btrfs
> > 
> > And, of course the btrfs-heatmap program keeps being a fun tool to
> > create visual timelapses of your filesystem, so you can learn how
> > your usage pattern is resulting in allocation of space by btrfs,
> > and so that you can visually see what the effect of your btrfs
> > balance attempts is:  
> 
> That's interesting, but ultimately, users shoudln't have to
> micromanage their filesystem to that level, even btrfs.
> 
> a) What is wrong in my nightly script that I should fix/improve?

You may want to try
https://www.spinics.net/lists/linux-btrfs/msg52076.html

> b) How do I recover from my current state?

That script may work it's way through.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Lionel Bouton
Le 14/05/2017 à 22:15, Marc MERLIN a écrit :
> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
>> On 05/13/2017 10:54 PM, Marc MERLIN wrote:
>>> Kernel 4.11, btrfs-progs v4.7.3
>>>
>>> I run scrub and balance every night, been doing this for 1.5 years on this
>>> filesystem.
>> What are the exact commands you run every day?
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20

usage=20 is pretty low: this means you don't try to reallocate and
regroup together block groups that are filled more than 20%.
Constantly using this settings has left lots of allocated block groups
that are mostly empty on your filesystem (a little more than 20% used).

The rebalance subject is a bit complex. With an empty filesystem you
almost don't need it as group creation is sparse and it's OK to have
mostly empty groups. When your filesystem begins to fill up you have to
raise the usage target to be able to reclaim space (as the fs fills up
most of your groups do too) so that new block creation can happen.

I've coded one Ruby script which tries to balance between the cost of
reallocating group and the need for it. The basic idea is that it tries
to keep the proportion of free space "wasted" by being allocated
although it isn't used below a threshold. It will bring this proportion
down enough through balance that minor reallocation won't trigger a new
balance right away. It should handle pathological conditions as well as
possible and it won't spend more than 2 hours working on a single
filesystem by default. We deploy this as a daily cron script through
Puppet on all our systems and it works very well (I didn't have to use
balance manually to manage free space since we did that).
Note that by default it sleeps a random amount of time to avoid IO
spikes on VMs running on the same host. You can either edit it or pass
it "0" which will be used for the max amount of time to sleep bypassing
this precaution.

Here is the latest version : https://pastebin.com/Rrw1GLtx
Given its current size, I should probably push it on github...

I've seen other maintenance scripts mentioned on this list so you might
something simpler or more targeted to your needs by browsing through the
list's history.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD (my -o ssd "summary")

2017-05-14 Thread Hans van Kranenburg
On 05/14/2017 08:01 PM, Tomasz Kusmierz wrote:
> All stuff that Chris wrote holds true, I just wanted to add flash
> specific information (from my experience of writing low level code
> for operating flash)

Thanks!

> [... erase ...]

> In terms of over provisioning of SSD it’s a give and take
> relationship … on good drive there is enough over provisioning to
> allow a normal operation on systems without TRIM … now if you would
> use a 1TB drive daily without TRIM and have only 30GB stored on it
> you will have fantastic performance but if you will want to store
> 500GB at roughly 200GB you will hit a brick wall and you writes will
> slow dow to megabytes / s … this is symptom of drive running out of
> over provisioning space … if you would run OS that issues trim, this
> problem would not exist since drive would know that whole 970GB of
> space is free and it would be pre-emptively erased days before.

== ssd_spread ==

The worst case behaviour is the btrfs ssd_spread mount option in
combination with not having discard enabled. It has a side effect of
minimizing the reuse of free space previously written in.

== ssd ==

[And, since I didn't write a "summary post" about this issue yet, here
is my version of it:]

The default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with writing and deleting many
files that are not too big also causes this pattern, ending up with the
physical address space fully allocated and written to.

My favourite videos about this: *)

ssd (write pattern is small increments in /var/log/mail.log, a mail
spool on /var/spool/postfix (lots of file adds and deletes), and mailman
archives with a lot of little files):

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

*) The picture uses Hilbert Curve ordering (see link below) and shows
the four last created DATA block groups appended together. (so a new
chunk allocation pushes the others back in the picture).
https://github.com/knorrie/btrfs-heatmap/blob/master/doc/curves.md

 * What the ssd mode does, is simply setting a lower boundary to the
size of free space fragments that are reused.
 * In combination with always trying to walk forward inside a block
group, not looking back at freed up space, it fills up with a shotgun
blast pattern when you do writes and deletes all the time.
 * When a write comes in that is bigger than any free space part left
behind, a new chunk gets allocated, and the bad pattern continues in there.
 * Because it keeps allocating more and more new chunks, and keeps
circling around in the latest one, until a big write is done, it leaves
mostly empty ones behind.
 * Without 'discard', the SSD will never learn that all the free space
left behind is actually free.
 * Eventually all raw disk space is allocated, and users run into
problems with ENOSPC and balance etc.

So, enabling this ssd mode actually means it starts choking itself to
death here.

When users see this effect, they start scheduling balance operations, to
compact free space to bring the amount of allocated but unused space
down a bit.
 * But, doing that is causing just more and more writes to the ssd.
 * Also, since balance takes a "usage" argument and not a "how badly
fragmented" argument, it's causing lots of unnecessary rewriting of data.
 * And, with a decent amount (like a few thousand) subvolumes, all
having a few snapshots of their own, the ratio data:metadata written
during balance is skyrocketing, causing not only the data to be
rewritten, but also causing pushing out lots of metadata to the ssd.
(example: on my backup server rewriting 1GiB of data causes writing of
>40GiB of metadata, where probably 99.99% of those writes are some kind
of intermediary writes which are immediately invalidated during the next
btrfs transaction that is done).

All in all, this reminds me of the series "breaking bad", where every
step taken to try fix things, only made things worse. At every bullet
point above, this is also happening.

== nossd ==

nossd mode (even still without discard) allows a pattern of overwriting
much more previously used space, causing many more implicit discards to
happen because of the overwrite information the ssd gets.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

> And last part - hard drive is not aware of filesystem and partitions
> … so you could have 400GB on this 1TB drive left unpartitioned and
> still you would be cooked. Technically speaking using as much as
> possible space on a SSD to a FS and OS that supports trim will give
> you best performance because drive will be notified of as much as
> possible disk space that is actually free …..
> 
> So, to summaries:

> - don’t try to outsmart built in mechanics of SSD (people that
> suggest that are just morons that want to have 5 minutes of fame).

This is exactly what the btrfs ssd options are trying to do.

Still, I don't think it's very nice to call Chris 

Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Marc MERLIN
On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> > Kernel 4.11, btrfs-progs v4.7.3
> > 
> > I run scrub and balance every night, been doing this for 1.5 years on this
> > filesystem.
> 
> What are the exact commands you run every day?
 
http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
(at the bottom)
every night:
1) scrub
2) balance -musage=0
3) balance -musage=20
4) balance -dusage=0
5) balance -dusage=20

> > How did I get into such a misbalanced state when I balance every night?
> 
> I don't know, since I don't know what you do exactly. :)
 
Now you do :)

> > My filesystem is not full, I can write just fine, but I sure cannot
> > rebalance now.
> 
> Yes, because you have quite some allocated but unused space. If btrfs
> cannot just allocate more chunks, it starts trying a bit harder to reuse
> all the empty spots in the already existing chunks.

Ok. shouldn't balance fix problems just like this?
I have 60GB-ish free, or in this case that's also >25%, that's a lot

Speaking of unallocated, I have more now:
Device unallocated:  993.00MiB

This kind of just magically fixed itself during snapshot rotation and
deletion I think.
Sure enough, balance works again, but this feels pretty fragile.
Looking again:
Device size: 228.67GiB
Device allocated:227.70GiB
Device unallocated:  993.00MiB
Free (estimated): 58.53GiB  (min: 58.53GiB)

You're saying that I need unallocated space for new chunks to be
created, which is required by balance.
Should btrfs not take care of keeping some space for me?
Shoudln't a nigthly balance, which I'm already doing, help even more
with this?

> > Besides adding another device to add space, is there a way around this
> > and more generally not getting into that state anymore considering that
> > I already rebalance every night?
> 
> Add monitoring and alerting on the amount of unallocated space.
> 
> FWIW, this is what I use for that purpose:
> 
> https://packages.debian.org/sid/munin-plugins-btrfs
> https://packages.debian.org/sid/monitoring-plugins-btrfs
> 
> And, of course the btrfs-heatmap program keeps being a fun tool to
> create visual timelapses of your filesystem, so you can learn how your
> usage pattern is resulting in allocation of space by btrfs, and so that
> you can visually see what the effect of your btrfs balance attempts is:

That's interesting, but ultimately, users shoudln't have to micromanage
their filesystem to that level, even btrfs.

a) What is wrong in my nightly script that I should fix/improve?
b) How do I recover from my current state?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
ome page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Hans van Kranenburg
On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> Kernel 4.11, btrfs-progs v4.7.3
> 
> I run scrub and balance every night, been doing this for 1.5 years on this
> filesystem.

What are the exact commands you run every day?

> But it has just started failing:
> [...]

> saruman:~# btrfs fi usage /mnt/btrfs_pool1/
> Overall:
> Device size:   228.67GiB
> Device allocated:  228.67GiB
> Device unallocated:  1.00MiB
> Device missing:0.00B
> Used:  171.25GiB
> Free (estimated):   55.32GiB  (min: 55.32GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,single: Size:221.60GiB, Used:166.28GiB
>/dev/mapper/pool1   221.60GiB
> 
> Metadata,single: Size:7.03GiB, Used:4.96GiB
>/dev/mapper/pool1 7.03GiB
> 
> System,single: Size:32.00MiB, Used:48.00KiB
>/dev/mapper/pool132.00MiB
> 
> Unallocated:
>/dev/mapper/pool1 1.00MiB
> 
> How did I get into such a misbalanced state when I balance every night?

I don't know, since I don't know what you do exactly. :)

> My filesystem is not full, I can write just fine, but I sure cannot
> rebalance now.

Yes, because you have quite some allocated but unused space. If btrfs
cannot just allocate more chunks, it starts trying a bit harder to reuse
all the empty spots in the already existing chunks.

> Besides adding another device to add space, is there a way around this
> and more generally not getting into that state anymore considering that
> I already rebalance every night?

Add monitoring and alerting on the amount of unallocated space.

FWIW, this is what I use for that purpose:

https://packages.debian.org/sid/munin-plugins-btrfs
https://packages.debian.org/sid/monitoring-plugins-btrfs

And, of course the btrfs-heatmap program keeps being a fun tool to
create visual timelapses of your filesystem, so you can learn how your
usage pattern is resulting in allocation of space by btrfs, and so that
you can visually see what the effect of your btrfs balance attempts is:

https://github.com/knorrie/btrfs-heatmap/
https://packages.debian.org/sid/btrfs-heatmap
https://apps.fedoraproject.org/packages/btrfs-heatmap
https://aur.archlinux.org/packages/python-btrfs-heatmap/

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
All stuff that Chris wrote holds true, I just wanted to add flash specific 
information (from my experience of writing low level code for operating flash)

So with flash, to erase you have to erase a large allocation block, usually it 
used to be 128kB (plus some crc data and stuff makes more than 128kB, but we 
are talking functional data storage space) on never setups it can be megabytes 
… device dependant really.
To erase a block you need to provide whole 128 x 8 bits with voltage higher 
that is usually used for IO (can be even 15V) so it requires an external supply 
or build in internal charge pump to provide that voltage to a block erasure 
circuitry. This process generates a lot of heat and requires a lot of energy, 
so consensus back in the day was that you could erase one block at a time and 
this could take up to 200ms (0.2 second). After a erase you need to check 
whenever all bits are set to 1 (charged state) and then sector is marked as 
ready for storage.

Of course, flash memories are moving forward and in more demanding environments 
there are solutions where blocks are grouped into groups which have separate 
eraser circuits that will allow errasure to be performed in parallel in 
multiple parts of flash module, still you are bound to one per group.

Another problem is that erasure procedure locally does increase temperature and 
on flat flashes it’s not that much of a problem, but on emerging solutions like 
3d flashed locally we might experience undesired temperature increases that 
would either degrade life span of flash or simply erase neighbouring blocks. 

In terms of over provisioning of SSD it’s a give and take relationship … on 
good drive there is enough over provisioning to allow a normal operation on 
systems without TRIM … now if you would use a 1TB drive daily without TRIM and 
have only 30GB stored on it you will have fantastic performance but if you will 
want to store 500GB at roughly 200GB you will hit a brick wall and you writes 
will slow dow to megabytes / s … this is symptom of drive running out of over 
provisioning space … if you would run OS that issues trim, this problem would 
not exist since drive would know that whole 970GB of space is free and it would 
be pre-emptively erased days before. 

And last part - hard drive is not aware of filesystem and partitions … so you 
could have 400GB on this 1TB drive left unpartitioned and still you would be 
cooked. Technically speaking using as much as possible space on a SSD to a FS 
and OS that supports trim will give you best performance because drive will be 
notified of as much as possible disk space that is actually free …..

So, to summaries: 
- don’t try to outsmart built in mechanics of SSD (people that suggest that are 
just morons that want to have 5 minutes of fame).
- don’t buy crap SSD and expect it to behave like good one if you use below 
certain % of it … it’s stupid, buy more reasonable SSD but smaller and store 
slow data on spinning rust.
- read more books and wikipedia, not jumping down on you but internet is filled 
with people that provide false information, sometimes unknowingly and swear by 
it ( Dunning–Kruger effect :D ) and some of them are very good and making all 
theories sexy and stuff … you simply have to get used to it… 
- if something is to good to be true, than it’s not
- promise of future performance gains is a domain of “sleazy salesman"



> On 14 May 2017, at 17:21, Chris Murphy  wrote:
> 
> On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> 
>> When I was doing my ssd research the first time around, the going
>> recommendation was to keep 20-33% of the total space on the ssd entirely
>> unallocated, allowing it to use that space as an FTL erase-block
>> management pool.
> 
> Any brand name SSD has its own reserve above its specified size to
> ensure that there's decent performance, even when there is no trim
> hinting supplied by the OS; and thereby the SSD can only depend on LBA
> "overwrites" to know what blocks are to be freed up.
> 
> 
>> Anyway, that 20-33% left entirely unallocated/unpartitioned
>> recommendation still holds, right?
> 
> Not that I'm aware of. I've never done this by literally walling off
> space that I won't use. IA fairly large percentage of my partitions
> have free space so it does effectively happen as far as the SSD is
> concerned. And I use fstrim timer. Most of the file systems support
> trim.
> 
> Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
> system that would not issue trim commands on this drive, and it was
> doing full performance writes through that point. Then deleted maybe
> 5% of the files, and then refill the drive to 98% again, and it was
> the same performance.  So it must have had enough in reserve to permit
> full performance "overwrites" which were in effect directed to reserve
> blocks as the freed up blocks were being erased. Thus the erasure
> happening on the fly 

Re: 4.11: da_remove called for id=16 which is not allocated.

2017-05-14 Thread Marc MERLIN
My apologies, this was for the bcache list, sorry about this.

On Sun, May 14, 2017 at 08:25:22AM -0700, Marc MERLIN wrote:
> 
> gargamel:/sys/block/bcache16/bcache# echo 1 > stop
> 
> bcache: bcache_device_free() bcache16 stopped
> [ cut here ]
> WARNING: CPU: 7 PID: 11051 at lib/idr.c:383 ida_remove+0xe8/0x10b
> ida_remove called for id=16 which is not allocated.
> Modules linked in: uas usb_storage veth ip6table_filter ip6_tables 
> ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun 
> autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace 
> fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat 
> xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 
> hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 
> nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss 
> snd_mixer_oss bcache kvm_intel kvm irqbypass snd_hda_codec_realtek 
> snd_hda_codec_generic snd_cmipci snd_hda_intel snd_mpu401_uart snd_opl3_lib 
> snd_hda_codec snd_rawmidi snd_hda_core rc_ati_x10 snd_hwdep snd_seq_device 
> ati_remote
>  snd_pcm eeepc_wmi asix snd_timer usbserial asus_wmi usbnet rc_core snd 
> sparse_keymap libphy rfkill hwmon lpc_ich soundcore mei_me parport_pc wmi 
> tpm_infineon parport tpm_tis i2c_i801 battery input_leds tpm_tis_core i915 
> tpm pcspkr evdev e1000e ptp pps_core fuse raid456 multipath mmc_block 
> mmc_core lrw ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor 
> async_memcpy async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc 
> aesni_intel aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci 
> xhci_hcd ehci_hcd r8169 sata_sil24 mii usbcore thermal mvsas fan libsas 
> scsi_transport_sas [last unloaded: ftdi_sio]
> CPU: 7 PID: 11051 Comm: kworker/7:13 Tainted: G U  W   
> 4.11.0-amd64-preempt-sysrq-20170406 #2
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
> 04/27/2013
> Workqueue: events cached_dev_free [bcache]
> Call Trace:
>  dump_stack+0x61/0x7d
>  __warn+0xc2/0xdd
>  warn_slowpath_fmt+0x5a/0x76
>  ida_remove+0xe8/0x10b
>  ida_simple_remove+0x2e/0x43
>  bcache_device_free+0x8c/0xc4 [bcache]
>  cached_dev_free+0x6b/0xe1 [bcache]
>  process_one_work+0x193/0x2b0
>  worker_thread+0x1e9/0x2c1
>  ? rescuer_thread+0x2b1/0x2b1
>  kthread+0xfb/0x100
>  ? init_completion+0x24/0x24
>  ret_from_fork+0x2c/0x40
> ---[ end trace 12586d8b165ff8f2 ]---
> 
> Prior to that: 
> cd /sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1
> echo 1 > stop
> 
> I needed to complete stop and remove all traces of a bcache before I could 
> mdadm --stop the underlying array.
> 
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/ | PGP 
> 1024R/763BE901

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs fi usage crash when multiple device volume contains seed device

2017-05-14 Thread Luis de Bethencourt
Hi,

Chris Murphy suggested we move the discussion in this bugzilla thread:
https://bugzilla.kernel.org/show_bug.cgi?id=115851

To here, the mailing list.

Going to quote him to give context:
"This might be better discussed on list to ensure there's congruence in
dev and user expectations; and in particular I think this needs a design
that accounts for the realistic long term goal, so that a short term
scope doesn't interfere or make the long term more difficult.

The matrix of possibilities, most of which are not yet implemented in
btrfs-progs:

1. 1 dev seed -> 1 dev sprout
2. 2+ dev seed -> 1 dev sprout
3. 1 dev seed -> 2+ dev sprout
4. 2+ dev seed -> 2+ dev sprout

Near as I can tell 2, 3, 4 are not implemented. It's an immediate
problem whether and how the profile (single, raid0, raid1) is to be
inherited from seed to sprout. If I have a 4 disk raid1 volume, to
create a sprout must I add a minimum of two devices? Or is it valid to
have raid1 profile seed chunks, where writes to go single profile sprout
chunks? Anyway point is, it needs a design to answer these things.

Next, and even more importantly as it applies to the simple case of
single to single, the way we do this right now is beyond confusing
because the remount ro to rw changes the volume UUID being mounted. The
ro mount is the seed, the rw mount is the sprout. This is not really a
remount, it's a umount of the seed, and a mount of the sprout. But what
if there's more than one sprout? This is asking for trouble so I think
the remount rw should be disallowed making it clear the ro seed cannot
be mounted rw. Instead it's necessary to umount it and explicitly mount
the rw sprout, and which sprout.

Also part of the ambiguity is that 'btrfs dev add' is more like
mkfs.btrfs in the context of seed-sprout. The new device isn't really
added to the seed, because the seed is read only. What's really
happening is a mkfs.btrfs with a "backing device" which is the seed; in
some sense it has more in common with the mkfs.btrfs --rootdir option. 

So I even wonder if 'btrfs dev add' is appropriate for creating sprouts,
and if instead it should be in mkfs.btfs with a --seed option to specify
the backing seed, and thereby what we are making is a sprout, which has
a new UUID, and possibly different chunk profiles than the seed."

Thanks,
Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Chris Murphy
On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote:

> When I was doing my ssd research the first time around, the going
> recommendation was to keep 20-33% of the total space on the ssd entirely
> unallocated, allowing it to use that space as an FTL erase-block
> management pool.

Any brand name SSD has its own reserve above its specified size to
ensure that there's decent performance, even when there is no trim
hinting supplied by the OS; and thereby the SSD can only depend on LBA
"overwrites" to know what blocks are to be freed up.


> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

Not that I'm aware of. I've never done this by literally walling off
space that I won't use. IA fairly large percentage of my partitions
have free space so it does effectively happen as far as the SSD is
concerned. And I use fstrim timer. Most of the file systems support
trim.

Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
system that would not issue trim commands on this drive, and it was
doing full performance writes through that point. Then deleted maybe
5% of the files, and then refill the drive to 98% again, and it was
the same performance.  So it must have had enough in reserve to permit
full performance "overwrites" which were in effect directed to reserve
blocks as the freed up blocks were being erased. Thus the erasure
happening on the fly was not inhibiting performance on this SSD. Now
had I gone to 99.9% full, and then delete say 1GiB, and then started
going a bunch of heavy small file writes rather than sequential? I
don't know what would happening, it might have choked because this is
a lot more work for the SSD to deal with heavy IOPS and erasure.

It will invariably be something that's very model and even firmware
version specific.



>  Am I correct in asserting that if one
> is following that, the FTL already has plenty of erase-blocks available
> for management and the discussion about filesystem level trim and free
> space management becomes much less urgent, tho of course it's still worth
> considering if it's convenient to do so?

Most file systems don't direct writes to new areas, they're fairly
prone to overwriting. So the firmware is going to get notified fairly
quickly with either trim or an overwrite, which LBAs are stale. It's
probably more important with Btrfs which has more variable behavior,
it can continue to direct new writes to recently allocated chunks
before it'll do overwrites in older chunks that have free space.


> And am I also correct in believing that while it's not really worth
> spending more to over-provision to the near 50% as I ended up doing, if
> things work out that way as they did with me because the difference in
> price between 30% overprovisioning and 50% overprovisioning ends up being
> trivial, there's really not much need to worry about active filesystem
> trim at all, because the FTL has effectively half the device left to play
> erase-block musical chairs with as it decides it needs to?


I think it's not worth to overprovision by default ever. Use all of
that space until you have a problem. If you have a 256G drive, you
paid to get the spec performance for 100% of those 256G. You did not
pay that company to second guess things and have cut it slack by
overprovisioning from the outset.

I don't know how long it takes for erasure to happen though, so I have
no idea how much overprovisioning is really needed at the write rate
of the drive, so that it can erase at the same rate as writes, in
order to avoid a slow down.

I guess an even worse test would be one that intentionally fragments
across erase block boundaries, forcing the firmware to be unable to do
erasures without first migrating partially full blocks in order to
make them empty, so they can then be erased, and now be used for new
writes. That sort of shuffling is what will separate the good from
average drives, and why the drives have multicore CPUs on them, as
well as most now having on the fly always on encryption.

Even completely empty, some of these drives have a short term higher
speed write which falls back to a lower speed as the fast flash gets
full. After some pause that fast write capability is restored for
future writes. I have no idea if this is separate kind of flash on the
drive, or if it's just a difference in encoding data onto the flash
that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D
VNAND. That sounds like an encoding method; it's fast but inefficient
and probably needs reencoding.

But that's the thing, the firmware is really complicated now.

I kinda wonder if f2fs could be chopped down to become a modular
allocator for the existing file systems; activate that allocation
method with "ssd" mount option rather than whatever overly smart thing
it does today that's based on assumptions that are now likely
outdated.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line 

4.11: da_remove called for id=16 which is not allocated.

2017-05-14 Thread Marc MERLIN

gargamel:/sys/block/bcache16/bcache# echo 1 > stop

bcache: bcache_device_free() bcache16 stopped
[ cut here ]
WARNING: CPU: 7 PID: 11051 at lib/idr.c:383 ida_remove+0xe8/0x10b
ida_remove called for id=16 which is not allocated.
Modules linked in: uas usb_storage veth ip6table_filter ip6_tables ebtable_nat 
ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog 
binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc 
ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 
nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 
dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat 
nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm 
irqbypass snd_hda_codec_realtek snd_hda_codec_generic snd_cmipci snd_hda_intel 
snd_mpu401_uart snd_opl3_lib snd_hda_codec snd_rawmidi snd_hda_core rc_ati_x10 
snd_hwdep snd_seq_device ati_remote
 snd_pcm eeepc_wmi asix snd_timer usbserial asus_wmi usbnet rc_core snd 
sparse_keymap libphy rfkill hwmon lpc_ich soundcore mei_me parport_pc wmi 
tpm_infineon parport tpm_tis i2c_i801 battery input_leds tpm_tis_core i915 tpm 
pcspkr evdev e1000e ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw 
ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy 
async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel 
aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci xhci_hcd ehci_hcd 
r8169 sata_sil24 mii usbcore thermal mvsas fan libsas scsi_transport_sas [last 
unloaded: ftdi_sio]
CPU: 7 PID: 11051 Comm: kworker/7:13 Tainted: G U  W   
4.11.0-amd64-preempt-sysrq-20170406 #2
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
Workqueue: events cached_dev_free [bcache]
Call Trace:
 dump_stack+0x61/0x7d
 __warn+0xc2/0xdd
 warn_slowpath_fmt+0x5a/0x76
 ida_remove+0xe8/0x10b
 ida_simple_remove+0x2e/0x43
 bcache_device_free+0x8c/0xc4 [bcache]
 cached_dev_free+0x6b/0xe1 [bcache]
 process_one_work+0x193/0x2b0
 worker_thread+0x1e9/0x2c1
 ? rescuer_thread+0x2b1/0x2b1
 kthread+0xfb/0x100
 ? init_completion+0x24/0x24
 ret_from_fork+0x2c/0x40
---[ end trace 12586d8b165ff8f2 ]---

Prior to that: 
cd /sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1
echo 1 > stop

I needed to complete stop and remove all traces of a bcache before I could 
mdadm --stop the underlying array.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-14 Thread Andy Lutomirski
On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger  wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers  wrote:
>>
>> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang 
>>> out]

>> Yes, PIDs have traditionally been global, but today we have PID namespaces, 
>> and
>> many other isolation features such as mount namespaces.  Nothing is perfect, 
>> of
>> course, and containers are a lot worse than VMs, but it seems weird to use 
>> that
>> as an excuse to knowingly make things worse...
>>

Indeed.  Not only PID namespaces -- we have hidepid and we can simply
unmount /proc.  "There are other info leaks" is a poor excuse.

>>>
> Fortunately, the days of timesharing seem to well behind us.  For
> those people who think that containers are as secure as VM's (hah,
> hah, hah), it might be that best way to handle this is to have a mount
> option that requires root access to this functionality.  For those
> people who really care about this, they can disable access.
>>>
>>> Or use separate filesystems for each container so that exploitable bugs
>>> that shut down the filesystem can't be used to kill the other
>>> containers.  You could use a torrent of metadata-heavy operations
>>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>>> the other containers.
>>>
 What would be the reason for not putting this behind
 capable(CAP_SYS_ADMIN)?

 What possible legitimate function could this functionality serve to
 users who don't own your filesystem?
>>>
>>> As I've said before, it's to enable dedupe tools to decide, given a set
>>> of files with shareable blocks, roughly how many other times each of
>>> those shareable blocks are shared so that they can make better decisions
>>> about which file keeps its shareable blocks, and which file gets
>>> remapped.  Dedupe is not a privileged operation, nor are any of the
>>> tools.
>>>
>>
>> So why does the ioctl need to return all extent mappings for the entire
>> filesystem, instead of just the share count of each block in the file that 
>> the
>> ioctl is called on?
>
> One possibility is that the ioctl() can return the mapping for all inodes
> owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> than one if there is a reason to do so) with all the other allocated blocks
> for inodes the user doesn't have permission to access?

Sounds like it could be reasonable.  But you don't want "owned by the
calling PID" precisely -- you also need to check
kgid_has_mapping(current_user_ns(), inode->i_gid), I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Duncan
Imran Geriskovan posted on Fri, 12 May 2017 15:02:20 +0200 as excerpted:

> On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:
>> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple
>> weeks ago so went back to find it.  Thanks. =:^)
>>
>> (I'm currently still on quarter-TB generation ssds, plus spinning rust
>> for the larger media partition and backups, and want to be rid of the
>> spinning rust, so am looking at half-TB to TB, which seems to be the
>> pricing sweet spot these days anyway.)
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust on
> notebooks which dictates that you should handle the device with care
> when running.
> 
> What was your primary motivation other than performance?

Well, the /immediate/ motivation is that the spinning rust is starting to 
hint that it's time to start thinking about rotating it out of service...

It's my main workstation so wall powered, but because it's the media and 
secondary backups partitions, I don't have anything from it mounted most 
of the time and because it /is/ spinning rust, I allow it to spin down.  
It spins right back up if I mount it, and reads seem to be fine, but if I 
let it set a bit after mount, possibly due to it spinning down again, 
sometimes I get write errors, SATA resets, etc.  Sometimes the write will 
then eventually appear to go thru, sometimes not, but once this happens, 
unmounting often times out, and upon a remount (which may or may not work 
until a clean reboot), the last writes may or may not still be there.

And the smart info, while not bad, does indicate it's starting to age, 
tho not extremely so.

Now even a year ago I'd have likely played with it, adjusting timeouts, 
spindowns, etc, attempting to get it working normally again.

But they say that ssd performance spoils you and you don't want to go 
back, and while it's a media drive and performance isn't normally an 
issue, those secondary backups to it as spinning rust sure take a lot 
longer than the primary backups to other partitions on the same pair of 
ssds that the working copies (of everything but media) are on.

Which means I don't like to do them... which means sometimes I put them 
off longer than I should.  Basically, it's another application of my 
"don't make it so big it takes so long to maintain you don't do it as you 
should" rule, only here, it's not the size but rather because I've been 
spoiled by the performance of the ssds.


So couple the aging spinning rust with the fact that I've really wanted 
to put media and the backups on ssd all along, only it couldn't be cost-
justified a few years ago when I bought the original ssds, and I now have 
my excuse to get the now cheaper ssds I really wanted all along. =:^)


As for reliability...  For archival usage I still think spinning rust is 
more reliable, and certainly more cost effective.

However, for me at least, with some real-world ssd experience under my 
belt now, including an early slow failure (more and more blocks going 
bad, I deliberately kept running it in btrfs raid1 mode with scrubs 
handling the bad blocks for quite some time, just to get the experience 
both with ssds and with btrfs) and replacement of one of the ssds with 
one I had originally bought for a different machine (my netbook, which 
went missing shortly thereafter), I now find ssds reliable enough for 
normal usage, certainly so if the data is valuable enough to have backups 
of it anyway, and if it's not valuable enough to be worth doing backups, 
then losing it is obviously not a big deal, because it's self-evidently 
worth less than the time, trouble and resources of doing that backup.

Particularly so if the speed of ssds helpfully encourages you to keep the 
backups more current than you would otherwise. =:^)

But spinning rust remains appropriate for long-term archival usage, like 
that third-level last-resort backup I like to make, then keep on the 
shelf, or store with a friend, or in a safe deposit box, or whatever, and 
basically never use, but like to have just in case.  IOW, that almost 
certainly write once, read-never, seldom update, last resort backup.  If 
three years down the line there's a fire/flood/whatever, and all I can 
find in the ashes/mud or retrieve from that friend is that three year old 
backup, I'll be glad to still have it.

Of course those who have multi-TB scale data needs may still find 
spinning rust useful as well, because while 4-TB ssds are available now, 
they're /horribly/ expensive.  But with 3D-NAND, even that use-case looks 
like it may go ssd in the next five years or so, leaving multi-year to 
decade-plus archiving, and perhaps say 50-TB-plus, but that's going to 
take long enough to actually write or otherwise do anything with it's 
effectively 

Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Duncan
Marc MERLIN posted on Sat, 13 May 2017 13:54:31 -0700 as excerpted:

> Kernel 4.11, btrfs-progs v4.7.3
> 
> I run scrub and balance every night, been doing this for 1.5 years on
> this filesystem.
> But it has just started failing:

> saruman:~# btrfs balance start -musage=0  /mnt/btrfs_pool1
> Done, had to relocate 0 out of 235 chunks

> saruman:~# btrfs balance start -dusage=0 
> /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks

Those aren't failing (as you likely know, but to explain for others 
following along), there's nothing to do as there's no entirely empty 
chunks.

But...

> saruman:~# btrfs balance start -musage=1  /mnt/btrfs_pool1
> ERROR: error during balancing '/mnt/btrfs_pool1':
> No space left on device

> aruman:~# btrfs balance start -dusage=10  /mnt/btrfs_pool1
> Done, had to relocate 0 out of 235 chunks

> saruman:~# btrfs balance start -dusage=20  /mnt/btrfs_pool1
> ERROR: error during balancing '/mnt/btrfs_pool1':
> No space left on device

... Errors there.  ENOSPC

[from dmesg]
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): relocating block group 598566305792 flags data
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): relocating block group 598566305792 flags data
> BTRFS info (device dm-2): 1 enospc errors during balance

> saruman:~# btrfs fi show /mnt/btrfs_pool1/
> Label: 'btrfs_pool1'  uuid: bc115001-a8d1-445c-9ec9-6050620efd0a
>   Total devices 1 FS bytes used 169.73GiB
>   devid1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1

> saruman:~# btrfs fi usage /mnt/btrfs_pool1/
> Overall:
> Device size:   228.67GiB
> Device allocated:  228.67GiB
> Device unallocated:  1.00MiB
> Device missing:0.00B
> Used:  171.25GiB
> Free (estimated):   55.32GiB  (min: 55.32GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,single: Size:221.60GiB, Used:166.28GiB
>/dev/mapper/pool1   221.60GiB
> 
> Metadata,single: Size:7.03GiB, Used:4.96GiB
>/dev/mapper/pool1 7.03GiB
> 
> System,single: Size:32.00MiB, Used:48.00KiB
>/dev/mapper/pool132.00MiB
> 
> Unallocated:
>/dev/mapper/pool1 1.00MiB

So we see it's fully chunk-allocated, no unallocated space, but gigs and 
gigs of empty space withing the chunk allocations, data chunks in 
particular.

> How did I get into such a misbalanced state when I balance every night?
> 
> My filesystem is not full, I can write just fine, but I sure cannot
> rebalance now.

Well, you can write just fine... for now.

After accounting for the global reserve coming out of metadata's reported 
free, there's about 1.5 GiB space in the metadata, and about 55 GiB of 
space in the data, so you should actually be able to write for some time 
before running out of either.

You just can't rebalance to chunk-defrag and reclaim chunks to 
unallocated, so they can be used for the other chunk type if necessary.
You're correct to be worried about this, but it's not immediately urgent.

> Besides adding another device to add space, is there a way around this
> and more generally not getting into that state anymore considering that
> I already rebalance every night?

What you /haven't/ yet said is what your nightly rebalance command, 
presumably scheduled, with -dusage and -musage, actually is.  How did you 
determine the usage amount to feed to the command, and was it dynamic, 
presumably determined by some script and changing based on the amount of 
unutilized space trapped within the data chunks, or static, the same 
usage command given every nite?

The other thing we don't have, and you might not have any idea either if 
it was simply scheduled and you hadn't been specifically checking, is a 
trendline of whether the post-balance unallocated space has been reducing 
over time, while the post-balance unutilized space within the data chunks 
was growing, or whether it happened all of a sudden.


If you've been following current discussion threads here, you may already 
know one possible specific trigger, as discussed, and more generically, 
there could be other specific triggers in the same general category.

In that thread the specific culprit appeared to be btrfs behavior with 
the (autodetected based on device rotational value as reported by sysfs) 
ssd mount option, in particular as it interacted with systemd's journal 
files, but it would apply to anything else with a similar write pattern.

The overall btrfs usage pattern was problematic as much like you 
apparently were getting but didn't catch before full allocation while he 
did, btrfs was continuing to allocate new chunks, even tho there was 
plenty of space left within existing chunks, none of which were entirely 
empty (so