Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 14:55:
> On 2018-08-17 08:28, Martin Steigerwald wrote:
> > Thanks for your detailed answer.
> > 
> > Austin S. Hemmelgarn - 17.08.18, 13:58:
> >> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> >>> Anyway, creating a new filesystem may have been better here
> >>> anyway,
> >>> cause it replaced an BTRFS that aged over several years with a new
> >>> one. Due to the increased capacity and due to me thinking that
> >>> Samsung 860 Pro compresses itself, I removed LZO compression. This
> >>> would also give larger extents on files that are not fragmented or
> >>> only slightly fragmented. I think that Intel SSD 320 did not
> >>> compress, but Crucial m500 mSATA SSD does. That has been the
> >>> secondary SSD that still had all the data after the outage of the
> >>> Intel SSD 320.
> >> 
> >> First off, keep in mind that the SSD firmware doing compression
> >> only
> >> really helps with wear-leveling.  Doing it in the filesystem will
> >> help not only with that, but will also give you more space to work
> >> with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> No, the better you compress the data, the _less_ data you are
> physically putting on the SSD, just like compressing a file makes it
> take up less space.  This actually makes it easier for the firmware
> to do wear-leveling.  Wear-leveling is entirely about picking where
> to put data, and by reducing the total amount of data you are writing
> to the SSD, you're making that decision easier for the firmware, and
> also reducing the number of blocks of flash memory needed (which also
> helps with SSD life expectancy because it translates to fewer erase
> cycles).

On one hand I can go with this, but:

If I fill the SSD 99% with already compressed data, in case it 
compresses itself for wear leveling, it has less chance to wear level 
than with 99% of not yet compressed data that it could compress itself.

That was the point I was trying to make.

Sure, with a fill rate of about 46% for home, compression would help the 
wear leveling. And if the controller does not compress at all, it would 
also.

Hmmm, maybe I enable "zstd", but on the other hand I save CPU cycles 
with not enabling it. 

> > However… I am not all that convinced that it would benefit me as
> > long as I have enough space. That SSD replacement more than doubled
> > capacity from about 680 TB to 1480 TB. I have ton of free space in
> > the filesystems – usage of /home is only 46% for example – and
> > there are 96 GiB completely unused in LVM on the Crucial SSD and
> > even more than 183 GiB completely unused on Samsung SSD. The system
> > is doing weekly "fstrim" on all filesystems. I think that this is
> > more than is needed for the longevity of the SSDs, but well
> > actually I just don´t need the space, so…
> > 
> > Of course, in case I manage to fill up all that space, I consider
> > using compression. Until then, I am not all that convinced that I´d
> > benefit from it.
> > 
> > Of course it may increase read speeds and in case of nicely
> > compressible data also write speeds, I am not sure whether it even
> > matters. Also it uses up some CPU cycles on a dual core (+
> > hyperthreading) Sandybridge mobile i5. While I am not sure about
> > it, I bet also having larger possible extent sizes may help a bit.
> > As well as no compression may also help a bit with fragmentation.
> 
> It generally does actually. Less data physically on the device means
> lower chances of fragmentation.  In your case, it may not improve

I thought "no compression" may help with fragmentation, but I think you 
think that "compression" helps with fragmentation and misunderstood what 
I wrote.

> speed much though (your i5 _probably_ can't compress data much faster
> than it can access your SSD's, which means you likely won't see much
> performance benefit other than reducing fragmentation).
> 
> > Well putting this to a (non-scientific) test:
> > 
> > […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head
> > -5 3,1Gparttable.ibd
> > 
> > […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> > parttable.ibd: 11583 extents found
> > 
> > Hmmm, already quite many extents after just about one week with the
> > new filesystem. On the old filesystem I had somewhat around
> > 4-5 extents on that file.
> 
> Filefrag doesn't properly handle compressed files on BTRFS.  It treats
> each 128KiB compression block as a separate extent, even though they
> may be contiguous as part of one BTRFS extent.  That one file by
> itself should have reported as about 25396 extents on the old volume
> (assuming it was entirely compressed), so your numbers seem to match
> up realistically.>

Oh, thanks. I did not know that filefrag does not understand extents for 
compressed files in BTRFS.

> > Well 

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi Roman.

Now with proper CC.

Roman Mamedov - 17.08.18, 14:50:
> On Fri, 17 Aug 2018 14:28:25 +0200
> 
> Martin Steigerwald  wrote:
> > > First off, keep in mind that the SSD firmware doing compression
> > > only
> > > really helps with wear-leveling.  Doing it in the filesystem will
> > > help not only with that, but will also give you more space to
> > > work with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> Do not consider SSD "compression" as a factor in any of your
> calculations or planning. Modern controllers do not do it anymore,
> the last ones that did are SandForce, and that's 2010 era stuff. You
> can check for yourself by comparing write speeds of compressible vs
> incompressible data, it should be the same. At most, the modern ones
> know to recognize a stream of binary zeroes and have a special case
> for that.

Interesting. Do you have any backup for your claim?

> As for general comment on this thread, always try to save the exact
> messages you get when troubleshooting or getting failures from your
> system. Saying just "was not able to add" or "btrfs replace not
> working" without any exact details isn't really helpful as a bug
> report or even as a general "experiences" story, as we don't know
> what was the exact cause of those, could that have been avoided or
> worked around, not to mention what was your FS state at the time (as
> in "btrfs fi show" and "fi df").

I had a screen.log, but I put it on the filesystem after the 
backup was made, so it was lost.

Anyway, the reason for not being able to add the device was the read 
only state of the BTRFS, as I wrote. Same goes for replace. I was able 
to read the error message just fine. AFAIR the exact wording was "read 
only filesystem".

In any case: It was a experience report, no request for help, so I don´t 
see why exact error messages are absolutely needed. If I had a support 
inquiry that would be different, I agree.

Thanks,
-- 
Martin




Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 15:01:
> On 2018-08-17 08:50, Roman Mamedov wrote:
> > On Fri, 17 Aug 2018 14:28:25 +0200
> > 
> > Martin Steigerwald  wrote:
> >>> First off, keep in mind that the SSD firmware doing compression
> >>> only
> >>> really helps with wear-leveling.  Doing it in the filesystem will
> >>> help not only with that, but will also give you more space to
> >>> work with.>> 
> >> While also reducing the ability of the SSD to wear-level. The more
> >> data I fit on the SSD, the less it can wear-level. And the better
> >> I compress that data, the less it can wear-level.
> > 
> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff.
> > You can check for yourself by comparing write speeds of
> > compressible vs incompressible data, it should be the same. At
> > most, the modern ones know to recognize a stream of binary zeroes
> > and have a special case for that.
> 
> All that testing write speeds forz compressible versus incompressible
> data tells you is if the SSD is doing real-time compression of data,
> not if they are doing any compression at all..  Also, this test only
> works if you turn the write-cache on the device off.

As the data still needs to be transferred to the SSD at least when the 
SATA connection is maxed out I bet you won´t see any difference in write 
speed whether the SSD compresses in real time or not.

> Besides, you can't prove 100% for certain that any manufacturer who
> does not sell their controller chips isn't doing this, which means
> there are a few manufacturers that may still be doing it.

Who really knows what SSD controller manufacturers are doing? I have not 
seen any Open Channel SSD stuff for laptops so far.

Thanks,
-- 
Martin




Hang after growing file system (4.14.48)

2018-08-17 Thread Martin Raiber
Hi,

after growing a single btrfs file system (on a loop device, with btrfs
fi resize max /fs ) it hangs later (sometimes much later). Symptoms:

* A unkillable btrfs process using 100% (of one) CPU in R state (no
kernel trace, cannot attach with strace, gdb or run linux perf)
* Other processes with following stack trace:

Fri Aug 17 16:21:06 2018] INFO: task python3:46794 blocked for more than
120 seconds.
[Fri Aug 17 16:21:06 2018]   Not tainted 4.14.48 #2
[Fri Aug 17 16:21:06 2018] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug 17 16:21:06 2018] python3 D    0 46794  46702 0x
[Fri Aug 17 16:21:06 2018] Call Trace:
[Fri Aug 17 16:21:06 2018]  ? __schedule+0x2de/0x7b0
[Fri Aug 17 16:21:06 2018]  schedule+0x32/0x80
[Fri Aug 17 16:21:06 2018]  schedule_preempt_disabled+0xa/0x10
[Fri Aug 17 16:21:06 2018]  __mutex_lock.isra.1+0x295/0x4c0
[Fri Aug 17 16:21:06 2018]  ? btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  show_vfsmnt+0x44/0x150
[Fri Aug 17 16:21:06 2018]  seq_read+0x314/0x3d0
[Fri Aug 17 16:21:06 2018]  __vfs_read+0x26/0x130
[Fri Aug 17 16:21:06 2018]  vfs_read+0x91/0x130
[Fri Aug 17 16:21:06 2018]  SyS_read+0x42/0x90
[Fri Aug 17 16:21:06 2018]  do_syscall_64+0x6e/0x120
[Fri Aug 17 16:21:06 2018]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Fri Aug 17 16:21:06 2018] RIP: 0033:0x7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RSP: 002b:7ffd80be2678 EFLAGS: 0246
ORIG_RAX: 
[Fri Aug 17 16:21:06 2018] RAX: ffda RBX: 56521bf7bb00
RCX: 7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RDX: 0400 RSI: 56521bf7bd30
RDI: 0004
[Fri Aug 17 16:21:06 2018] RBP: 0d68 R08: 7f67fe655700
R09: 0101
[Fri Aug 17 16:21:06 2018] R10: 56521bf7c0cc R11: 0246
R12: 7f67fd6d6440
[Fri Aug 17 16:21:06 2018] R13: 7f67fd6d5900 R14: 0064
R15: 

Regards,
Martin Raiber



Re: [PATCH v2] btrfs: btrfs_shrink_device should call commit transaction

2018-08-17 Thread David Sterba
On Mon, Aug 06, 2018 at 06:12:37PM +0800, Anand Jain wrote:
> test case btrfs/164 reported UAF..
> 
> [ 6712.084324] general protection fault:  [#1] PREEMPT SMP
> ::
> [ 6712.195423]  btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
> [ 6712.201424]  btrfs_commit_transaction+0x57d/0xa90 [btrfs]
> [ 6712.206999]  btrfs_rm_device+0x627/0x850 [btrfs]
> [ 6712.211800]  btrfs_ioctl+0x2b03/0x3120 [btrfs]
> ::
> 
> reason for this is that btrfs_shrink_device() adds the device resized to
> the fs_devices::resized_devices after it has called the last commit
> transaction.
> So the list fs_devices::resized_devices is not empty when
> btrfs_shrink_device() returns.
> Now the parent function btrfs_rm_device() calls
> btrfs_close_bdev(device);
> call_rcu(>rcu, free_device_rcu);
> and then does the commit transaction.
> The commit transaction goes through the fs_devices::resized_devices
> in btrfs_update_commit_device_size() and leads to UAF.
> 
> Fix this by making sure btrfs_shrink_device() calls the last needed
> btrfs_commit_transaction() before the return.
> 
> Reported-by: Lu Fengqi 
> Signed-off-by: Anand Jain 
> Tested-by: Lu Fengqi 

The transaction commit makes more sense here, also is consistent with
what grow does.

Reviewed-by: David Sterba 


Re: [PATCH 4/5] btrfs: remove a useless return statement in btrfs_block_rsv_add

2018-08-17 Thread David Sterba
On Sat, Aug 04, 2018 at 09:10:56PM +0800, Lu Fengqi wrote:
> Since ret must be 0 here, don't have to return separately in advance.
> 
> No functional change.
> 
> Signed-off-by: Lu Fengqi 

Added to misc-next too, the change is fairly trivial so I was pondering
about code readability impacts, but it's ok.


Re: [PATCH v3] btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata()

2018-08-17 Thread David Sterba
On Thu, Aug 09, 2018 at 09:46:04AM +0800, Lu Fengqi wrote:
> After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
> again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we can't
> properly free the num_bytes of the previous qgroup_reserve. Use a separate
> variable to store the num_bytes of the qgroup_reserve.
> 
> Delete the comment for the qgroup_reserved that does not exist and add a
> comment about use_global_rsv.
> 
> Fixes: c4c129db5da8 ("btrfs: drop unused parameter qgroup_reserved")
> Signed-off-by: Lu Fengqi 

Reviewed-by: David Sterba 


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 08:50, Roman Mamedov wrote:

On Fri, 17 Aug 2018 14:28:25 +0200
Martin Steigerwald  wrote:


First off, keep in mind that the SSD firmware doing compression only
really helps with wear-leveling.  Doing it in the filesystem will help
not only with that, but will also give you more space to work with.


While also reducing the ability of the SSD to wear-level. The more data
I fit on the SSD, the less it can wear-level. And the better I compress
that data, the less it can wear-level.


Do not consider SSD "compression" as a factor in any of your calculations or
planning. Modern controllers do not do it anymore, the last ones that did are
SandForce, and that's 2010 era stuff. You can check for yourself by comparing
write speeds of compressible vs incompressible data, it should be the same. At
most, the modern ones know to recognize a stream of binary zeroes and have a
special case for that.
All that testing write speeds forz compressible versus incompressible 
data tells you is if the SSD is doing real-time compression of data, not 
if they are doing any compression at all..  Also, this test only works 
if you turn the write-cache on the device off.


Besides, you can't prove 100% for certain that any manufacturer who does 
not sell their controller chips isn't doing this, which means there are 
a few manufacturers that may still be doing it.


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 08:28, Martin Steigerwald wrote:

Thanks for your detailed answer.

Austin S. Hemmelgarn - 17.08.18, 13:58:

On 2018-08-17 05:08, Martin Steigerwald wrote:

[…]

I have seen a discussion about the limitation in point 2. That
allowing to add a device and make it into RAID 1 again might be
dangerous, cause of system chunk and probably other reasons. I did
not completely read and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have
*all*
data. Also for the system chunk, which according to btrfs fi df /
btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
why it would need to disallow me to make it into an RAID 1 again
after one device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I
was
able to copy of all date of the degraded mount, I´d say it was a
RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
does two copies regardless of how many drives you use.)


So, what's happening here is a bit complicated.  The issue is entirely
with older kernels that are missing a couple of specific patches, but
it appears that not all distributions have their kernels updated to
include those patches yet.

In short, when you have a volume consisting of _exactly_ two devices
using raid1 profiles that is missing one device, and you mount it
writable and degraded on such a kernel, newly created chunks will be
single-profile chunks instead of raid1 chunks with one half missing.
Any write has the potential to trigger allocation of a new chunk, and
more importantly any _read_ has the potential to trigger allocation of
a new chunk if you don't use the `noatime` mount option (because a
read will trigger an atime update, which results in a write).

When older kernels then go and try to mount that volume a second time,
they see that there are single-profile chunks (which can't tolerate
_any_ device failures), and refuse to mount at all (because they
can't guarantee that metadata is intact).  Newer kernels fix this
part by checking per-chunk if a chunk is degraded/complete/missing,
which avoids this because all the single chunks are on the remaining
device.


How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e.
the one on the live distro that needs to be new enough? To one on this
laptop meanwhile is already 4.18.1.
Yes, the kernel used for recovery is the important one here.  I don't 
remember for certain when the patches went in, but I'm pretty sure it's 
been no eariler than 4.14.  FWIW, I'm pretty sure SystemRescueCD has a 
new enough kernel, but they still (sadly) lack zstd support.


I used latest GRML stable release 2017.05 which has an 4.9 kernel.
While I don't know exactly when the patches went in, I'm fairly certain 
that 4.9 never got them.



As far as avoiding this in the future:


I hope that with the new Samsung Pro 860 together with the existing
Crucial m500 I am spared from this for years to come. That Crucial SSD
according to SMART status about lifetime used has still quite some time
to go.
Yes, hopefully.  And the SMART status on that Crucial is probably right, 
they tend to do a very good job in my experience with accurately 
measuring life expectancy (that or they're just _really_ good at 
predicting failures, I've never had a Crucial SSD that did not indicate 
correctly in the SMART status that it would fail in the near future).



* If you're just pulling data off the device, mark the device
read-only in the _block layer_, not the filesystem, before you mount
it.  If you're using LVM, just mark the LV read-only using LVM
commands  This will make 100% certain that nothing gets written to
the device, and thus makes sure that you won't accidentally cause
issues like this.



* If you're going to convert to a single device,
just do it and don't stop it part way through.  In particular, make
sure that your system will not lose power.



* Otherwise, don't mount the volume unless you know you're going to
repair it.


Thanks for those. Good to keep in mind.
The last one is actually good advice in general, not just for BTRFS.  I 
can't count how many stories I've heard of people who tried to run half 
an array simply to avoid downtime, and ended up making things far worse 
than they were as a result.



For this laptop it was not all that important but I wonder about
BTRFS RAID 1 in enterprise environment, cause restoring from backup
adds a significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway,
cause it replaced an BTRFS that aged over several years with a new
one. Due to the increased capacity and due to me thinking that
Samsung 860 Pro compresses itself, I removed LZO compression. This
would also give larger extents on files that are not fragmented or
only slightly fragmented. I think that Intel SSD 320 did not
compress, but Crucial m500 mSATA SSD does. 

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Roman Mamedov
On Fri, 17 Aug 2018 14:28:25 +0200
Martin Steigerwald  wrote:

> > First off, keep in mind that the SSD firmware doing compression only
> > really helps with wear-leveling.  Doing it in the filesystem will help
> > not only with that, but will also give you more space to work with.
> 
> While also reducing the ability of the SSD to wear-level. The more data 
> I fit on the SSD, the less it can wear-level. And the better I compress 
> that data, the less it can wear-level.

Do not consider SSD "compression" as a factor in any of your calculations or
planning. Modern controllers do not do it anymore, the last ones that did are
SandForce, and that's 2010 era stuff. You can check for yourself by comparing
write speeds of compressible vs incompressible data, it should be the same. At
most, the modern ones know to recognize a stream of binary zeroes and have a
special case for that.

As for general comment on this thread, always try to save the exact messages
you get when troubleshooting or getting failures from your system. Saying just
"was not able to add" or "btrfs replace not working" without any exact details
isn't really helpful as a bug report or even as a general "experiences" story,
as we don't know what was the exact cause of those, could that have been
avoided or worked around, not to mention what was your FS state at the time
(as in "btrfs fi show" and "fi df").

-- 
With respect,
Roman


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Thanks for your detailed answer.  

Austin S. Hemmelgarn - 17.08.18, 13:58:
> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> > I have seen a discussion about the limitation in point 2. That
> > allowing to add a device and make it into RAID 1 again might be
> > dangerous, cause of system chunk and probably other reasons. I did
> > not completely read and understand it tough.
> > 
> > So I still don´t get it, cause:
> > 
> > Either it is a RAID 1, then, one disk may fail and I still have
> > *all*
> > data. Also for the system chunk, which according to btrfs fi df /
> > btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
> > why it would need to disallow me to make it into an RAID 1 again
> > after one device has been lost.
> > 
> > Or it is no RAID 1 and then what is the point to begin with? As I
> > was
> > able to copy of all date of the degraded mount, I´d say it was a
> > RAID 1.
> > 
> > (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
> > does two copies regardless of how many drives you use.)
> 
> So, what's happening here is a bit complicated.  The issue is entirely
> with older kernels that are missing a couple of specific patches, but
> it appears that not all distributions have their kernels updated to
> include those patches yet.
> 
> In short, when you have a volume consisting of _exactly_ two devices
> using raid1 profiles that is missing one device, and you mount it
> writable and degraded on such a kernel, newly created chunks will be
> single-profile chunks instead of raid1 chunks with one half missing.
> Any write has the potential to trigger allocation of a new chunk, and
> more importantly any _read_ has the potential to trigger allocation of
> a new chunk if you don't use the `noatime` mount option (because a
> read will trigger an atime update, which results in a write).
> 
> When older kernels then go and try to mount that volume a second time,
> they see that there are single-profile chunks (which can't tolerate
> _any_ device failures), and refuse to mount at all (because they
> can't guarantee that metadata is intact).  Newer kernels fix this
> part by checking per-chunk if a chunk is degraded/complete/missing,
> which avoids this because all the single chunks are on the remaining
> device.

How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e. 
the one on the live distro that needs to be new enough? To one on this 
laptop meanwhile is already 4.18.1.

I used latest GRML stable release 2017.05 which has an 4.9 kernel.

> As far as avoiding this in the future:

I hope that with the new Samsung Pro 860 together with the existing 
Crucial m500 I am spared from this for years to come. That Crucial SSD 
according to SMART status about lifetime used has still quite some time 
to go.

> * If you're just pulling data off the device, mark the device
> read-only in the _block layer_, not the filesystem, before you mount
> it.  If you're using LVM, just mark the LV read-only using LVM
> commands  This will make 100% certain that nothing gets written to
> the device, and thus makes sure that you won't accidentally cause
> issues like this.

> * If you're going to convert to a single device,
> just do it and don't stop it part way through.  In particular, make
> sure that your system will not lose power.

> * Otherwise, don't mount the volume unless you know you're going to
> repair it.

Thanks for those. Good to keep in mind.

> > For this laptop it was not all that important but I wonder about
> > BTRFS RAID 1 in enterprise environment, cause restoring from backup
> > adds a significantly higher downtime.
> > 
> > Anyway, creating a new filesystem may have been better here anyway,
> > cause it replaced an BTRFS that aged over several years with a new
> > one. Due to the increased capacity and due to me thinking that
> > Samsung 860 Pro compresses itself, I removed LZO compression. This
> > would also give larger extents on files that are not fragmented or
> > only slightly fragmented. I think that Intel SSD 320 did not
> > compress, but Crucial m500 mSATA SSD does. That has been the
> > secondary SSD that still had all the data after the outage of the
> > Intel SSD 320.
> 
> First off, keep in mind that the SSD firmware doing compression only
> really helps with wear-leveling.  Doing it in the filesystem will help
> not only with that, but will also give you more space to work with.

While also reducing the ability of the SSD to wear-level. The more data 
I fit on the SSD, the less it can wear-level. And the better I compress 
that data, the less it can wear-level.

> Secondarily, keep in mind that most SSD's use compression algorithms
> that are fast, but don't generally get particularly amazing
> compression ratios (think LZ4 or Snappy for examples of this).  In
> comparison, BTRFS provides a couple of options that are slower, but
> get far better ratios most of the time (zlib, 

Re: [PATCH] Btrfs: remove always true if branch in btrfs_get_extent

2018-08-17 Thread David Sterba
On Fri, Aug 17, 2018 at 05:05:28AM +0800, Liu Bo wrote:
> @path is always NULL when it comes to the if branch.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 05:08, Martin Steigerwald wrote:

Hi!

This happened about two weeks ago. I already dealt with it and all is
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully.
After that it did not boot the operating system anymore. Intel SSD 320,
latest firmware, which should patch this bug, but apparently does not,
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even
able to write onto this degraded filesystem. First I copied all data to
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and
since I had no other SSD at hand, did that. And yes, it did. So I
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But
at that time I was not able to mount the degraded BTRFS on the other SSD
as writable anymore, not even with "-f" "I know what I am doing". Thus I
was not able to add a device to it and btrfs balance it to RAID 1. Even
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again
via one full backup and restore cycle. However, this time I was able to
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD
outage.

Glad to hear I'm not the only one!


2. Con:  It does not allow me to add a device and balance to RAID 1 or
replace one device that is already missing at this time.
See below where you comment about this more, I've replied regarding it 
there.


3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical
data.

4. And yes, I know it does not replace a backup. As it was holidays and
I was lazy backup was two weeks old already, so I was happy to have all
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are
less than helpful. They indicate a corrupted filesystem instead of just
telling that one device is missing and "-o degraded" would help here.
Agreed, the kernel error messages need significant improvement, not just 
for this case, but in general (I would _love_ to make sure that there 
are exactly zero exit paths for open_ctree that don't involve a proper 
error message being printed beyond the ubiquitous `open_ctree failed` 
message you get when it fails).



I have seen a discussion about the limitation in point 2. That allowing
to add a device and make it into RAID 1 again might be dangerous, cause
of system chunk and probably other reasons. I did not completely read
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all*
data. Also for the system chunk, which according to btrfs fi df / btrfs
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it
would need to disallow me to make it into an RAID 1 again after one
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does
two copies regardless of how many drives you use.)
So, what's happening here is a bit complicated.  The issue is entirely 
with older kernels that are missing a couple of specific patches, but it 
appears that not all distributions have their kernels updated to include 
those patches yet.


In short, when you have a volume consisting of _exactly_ two devices 
using raid1 profiles that is missing one device, and you mount it 
writable and degraded on such a kernel, newly created chunks will be 
single-profile chunks instead of raid1 chunks with one half missing. 
Any write has the potential to trigger allocation of a new chunk, and 
more importantly any _read_ has the potential to trigger allocation of a 
new chunk if you don't use the `noatime` mount option (because a read 
will trigger an atime update, which results in a write).


When older kernels then go and try to mount that volume a second time, 
they see that there are single-profile chunks (which can't tolerate 
_any_ device failures), and refuse to mount at all (because they can't 
guarantee that metadata is intact).  Newer kernels fix this part by 
checking per-chunk if a chunk is degraded/complete/missing, which avoids 
this because all the single chunks are on the remaining device.


As far as avoiding this in the future:

* If you're just pulling data off the device, mark the device read-only 
in the _block layer_, not the filesystem, before you mount it.  If 
you're using LVM, just mark the LV read-only using LVM commands  This 
will make 100% certain that nothing gets written to the device, and thus 
makes sure that you won't accidentally cause issues 

Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi!

This happened about two weeks ago. I already dealt with it and all is 
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully. 
After that it did not boot the operating system anymore. Intel SSD 320, 
latest firmware, which should patch this bug, but apparently does not, 
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even 
able to write onto this degraded filesystem. First I copied all data to 
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and 
since I had no other SSD at hand, did that. And yes, it did. So I 
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But 
at that time I was not able to mount the degraded BTRFS on the other SSD 
as writable anymore, not even with "-f" "I know what I am doing". Thus I 
was not able to add a device to it and btrfs balance it to RAID 1. Even 
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again 
via one full backup and restore cycle. However, this time I was able to 
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via 
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD 
outage.

2. Con:  It does not allow me to add a device and balance to RAID 1 or 
replace one device that is already missing at this time.

3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical 
data.

4. And yes, I know it does not replace a backup. As it was holidays and 
I was lazy backup was two weeks old already, so I was happy to have all 
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are 
less than helpful. They indicate a corrupted filesystem instead of just 
telling that one device is missing and "-o degraded" would help here.


I have seen a discussion about the limitation in point 2. That allowing 
to add a device and make it into RAID 1 again might be dangerous, cause 
of system chunk and probably other reasons. I did not completely read 
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all* 
data. Also for the system chunk, which according to btrfs fi df / btrfs 
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it 
would need to disallow me to make it into an RAID 1 again after one 
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was 
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does 
two copies regardless of how many drives you use.)


For this laptop it was not all that important but I wonder about BTRFS 
RAID 1 in enterprise environment, cause restoring from backup adds a 
significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway, 
cause it replaced an BTRFS that aged over several years with a new one. 
Due to the increased capacity and due to me thinking that Samsung 860 
Pro compresses itself, I removed LZO compression. This would also give 
larger extents on files that are not fragmented or only slightly 
fragmented. I think that Intel SSD 320 did not compress, but Crucial 
m500 mSATA SSD does. That has been the secondary SSD that still had all 
the data after the outage of the Intel SSD 320.


Overall I am happy, cause BTRFS RAID 1 gave me access to the data after 
the SSD outage. That is the most important thing about it for me.

Thanks,
-- 
Martin




[PATCH] generic: test for deduplication between different files

2018-08-17 Thread fdmanana
From: Filipe Manana 

Test that deduplication of an entire file that has a size that is not
aligned to the filesystem's block size into a different file does not
corrupt the destination's file data.

This test is motivated by a bug found in Btrfs which is fixed by the
following patch for the linux kernel:

  "Btrfs: fix data corruption when deduplicating between different files"

XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
with the same corruption as in Btrfs - some bytes of a block get replaced
with zeroes after the deduplication.

Signed-off-by: Filipe Manana 
---
 tests/generic/505 | 84 +++
 tests/generic/505.out | 33 
 tests/generic/group   |  1 +
 3 files changed, 118 insertions(+)
 create mode 100755 tests/generic/505
 create mode 100644 tests/generic/505.out

diff --git a/tests/generic/505 b/tests/generic/505
new file mode 100755
index ..5ee232a2
--- /dev/null
+++ b/tests/generic/505
@@ -0,0 +1,84 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 505
+#
+# Test that deduplication of an entire file that has a size that is not aligned
+# to the filesystem's block size into a different file does not corrupt the
+# destination's file data.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch_dedupe
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+# The first byte with a value of 0xae starts at an offset (2518890) which is 
not
+# a multiple of the block size.
+$XFS_IO_PROG -f \
+   -c "pwrite -S 0x6b 0 2518890" \
+   -c "pwrite -S 0xae 2518890 102398" \
+   $SCRATCH_MNT/foo | _filter_xfs_io
+
+# Create a second file with a length not aligned to the block size, whose bytes
+# all have the value 0x6b, so that its extent(s) can be deduplicated with the
+# first file.
+$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
+
+# The file is filled with bytes having the value 0x6b from offset 0 to offset
+# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
+echo "File content before deduplication:"
+od -t x1 $SCRATCH_MNT/foo
+
+# Now deduplicate the entire second file into a range of the first file that
+# also has all bytes with the value 0x6b. The destination range's end offset
+# must not be aligned to the block size and must be less then the offset of
+# the first byte with the value 0xae (byte at offset 2518890).
+$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
+   | _filter_xfs_io
+
+# The bytes in the range starting at offset 2515659 (end of the deduplication
+# range) and ending at offset 2519040 (start offset rounded up to the block
+# size) must all have the value 0xae (and not replaced with 0x00 values).
+# In other words, we should have exactly the same data we had before we asked
+# for deduplication.
+echo "File content after deduplication and before unmounting:"
+od -t x1 $SCRATCH_MNT/foo
+
+# Unmount the filesystem and mount it again. This guarantees any file data in
+# the page cache is dropped.
+_scratch_cycle_mount
+
+# The bytes in the range starting at offset 2515659 (end of the deduplication
+# range) and ending at offset 2519040 (start offset rounded up to the block
+# size) must all have the value 0xae (and not replaced with 0x00 values).
+# In other words, we should have exactly the same data we had before we asked
+# for deduplication.
+echo "File content after unmounting:"
+od -t x1 $SCRATCH_MNT/foo
+
+status=0
+exit
diff --git a/tests/generic/505.out b/tests/generic/505.out
new file mode 100644
index ..7556b9fb
--- /dev/null
+++ b/tests/generic/505.out
@@ -0,0 +1,33 @@
+QA output created by 505
+wrote 2518890/2518890 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 102398/102398 bytes at offset 2518890
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 557771/557771 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content before deduplication:
+000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
+*
+11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
+11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
+*
+11777540 ae ae ae ae ae ae ae ae
+11777550
+deduped 557771/557771 bytes at offset 1957888
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content after deduplication and before unmounting:
+000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
+*
+11467540 6b 6b 6b 6b 

[PATCH] Btrfs: fix data corruption when deduplicating between different files

2018-08-17 Thread fdmanana
From: Filipe Manana 

If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
  # The first byte with a value of 0xae starts at an offset (2518890)
  # which is not a multiple of the sector size.
  $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo

  # Confirm the file content is full of bytes with values 0x6b and 0xae.
  $ od -t x1 /mnt/foo
  000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Create a second file with a length not aligned to the sector size,
  # whose bytes all have the value 0x6b, so that its extent(s) can be
  # deduplicated with the first file.
  $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar

  # Now deduplicate the entire second file into a range of the first file
  # that also has all bytes with the value 0x6b. The destination range's
  # end offset must not be aligned to the sector size and must be less
  # then the offset of the first byte with the value 0xae (byte at offset
  # 2518890).
  $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo

  # The bytes in the range starting at offset 2515659 (end of the
  # deduplication range) and ending at offset 2519040 (start offset
  # rounded up to the block size) must all have the value 0xae (and not
  # replaced with 0x00 values). In other words, we should have exactly
  # the same data we had before we asked for deduplication.
  $ od -t x1 /mnt/foo
  000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Unmount the filesystem and mount it again. This guarantees any file
  # data in the page cache is dropped.
  $ umount /dev/sdb
  $ mount /dev/sdb /mnt

  $ od -t x1 /mnt/foo
  000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
  11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1147 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
  # value of 0xae, data corruption happened due to the deduplication
  # operation.

So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:

  1) Source file's range ends at its i_size;
  2) Source file's i_size is not aligned to the sector size;
  3) Destination range does not cross the i_size of the destination file.

Fixes: e1d227a42ea2 ("btrfs: Handle unaligned length in extent_same")
CC: sta...@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana 
---
 fs/btrfs/ioctl.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 331b495d2db9..230644d1e439 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3479,6 +3479,25 @@ static int btrfs_extent_same_range(struct inode *src, 
u64 loff, u64 olen,
 
same_lock_start = min_t(u64, loff, dst_loff);
same_lock_len = max_t(u64, loff, dst_loff) + len - 
same_lock_start;
+   } else {
+   /*
+* If the source and destination inodes are different, the
+* source's range end offset matches the source's i_size, that
+* i_size is not a multiple of the sector size, and the
+* destination range does not go past the destination's i_size,
+* we must round down the length to the nearest sector size
+* multiple. If we don't do this adjustment we end replacing
+* with zeroes the bytes in the range that starts at the
+* deduplication range's end offset and ends at the next sector
+* size multiple.
+*/
+   if (loff + olen == i_size_read(src) &&
+   dst_loff + len < i_size_read(dst)) {
+   const u64 sz = BTRFS_I(src)->root->fs_info->sectorsize;
+
+   len = round_down(i_size_read(src), sz) - loff;
+   olen = len;
+   }
}
 
 again:
-- 
2.11.0



Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-17 Thread Menion
Ok, but I cannot guarantee that I don't need to cancel scrub during the process
As said, this is a domestic storage, and when scrub is running the
performance hit is big enough to prevent smooth streaming of HD and 4k
movies
Il giorno gio 16 ago 2018 alle ore 21:38  ha scritto:
>
> Could you show scrub status -d, then start a new scrub (all drives) and show 
> scrub status -d again? This may help us diagnose the problem.
>
> Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com:
> > I needed to resume scrub two times after an unclear shutdown (I was
> > cooking and using too much electricity) and two times after a manual
> > cancel, because I wanted to watch a 4k movie and the array
> > performances were not enough with scrub active.
> > Each time I resumed it, I checked also the status, and the total
> > number of data scrubbed was keep counting (never started from zero)
> > Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
> >  ha scritto:
> > >
> > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > > Hi
> > > > Well, I think it is worth to give more details on the array.
> > > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII 
> > > > enclosure
> > > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > > multiplexer behind it. So you cannot expect peak performance, which is
> > > > not the goal of this array (domestic data storage).
> > > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > > stable, it run in BOT mode.
> > > > Having said so, the scrub has been started (and resumed) on the array
> > > > mount point:
> > > >
> > > > sudo btrfs scrub start(resume) /media/storage/das1
> > >
> > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > > scrub end to end without cancelling or rebooting in between, what is
> > > the size on all disks (btrfs scrub status -d)?
> > >
> > > > even if reading the documentation I understand that it is the same
> > > > invoking it on mountpoint or one of the HDD in the array.
> > > > In the end, especially for a RAID5 array, does it really make sense to
> > > > scrub only one disk in the array???
> > >
> > > You would set up a shell for-loop and scrub each disk of the array
> > > in turn. Each scrub would correct errors on a single device.
> > >
> > > There was a bug in btrfs scrub where scrubbing the filesystem would
> > > create one thread for each disk, and the threads would issue commands
> > > to all disks and compete with each other for IO, resulting in terrible
> > > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > > there are no competing threads, so the scrub runs many times faster.
> > > With this bug the total time to scrub all disks individually is usually
> > > less than the time to scrub the entire filesystem at once, especially
> > > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > > much kinder to any other process trying to use the filesystem at the
> > > same time).
> > >
> > > It appears this bug is not fixed, based on some timing results I am
> > > getting from a test array. iostat shows 10x more reads than writes on
> > > all disks even when all blocks on one disk are corrupted and the scrub
> > > is given only a single disk to process (that should result in roughly
> > > equal reads on all disks slightly above the number of writes on the
> > > corrupted disk).
> > >
> > > This is where my earlier caveat about performance comes from. Many parts
> > > of btrfs raid5 are somewhere between slower and *much* slower than
> > > comparable software raid5 implementations. Some of that is by design:
> > > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > > metadata to verify data block csums in scrub, and the difference would
> > > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > > still seems high especially when compared to btrfs raid1 that has the
> > > same metadata csum reading issue without the huge performance gap.
> > >
> > > It seems like btrfs raid5 could still use a thorough profiling to figure
> > > out where it's spending all its IO.
> > >
> > > > Regarding the data usage, here you have the current figures:
> > > >
> > > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > > [sudo] password for menion:
> > > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > > Total devices 1 FS bytes used 11.44GiB
> > > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > > >
> > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > Total devices 5 FS bytes used 6.57TiB
> > > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > > >
> 

Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

2018-08-17 Thread Nikolay Borisov



On 17.08.2018 00:07, Liu Bo wrote:
> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> while searching btree, locking is always acquired in spinning mode and
> then converted to blocking mode if necessary, and in some hot paths we may
> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> 
> When acquiring locks, both of reader and writer need to wait for blocking
> readers and writers to complete before doing read_lock()/write_lock().
> 
> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> make lockdep happy before doing its actual clearing blocking job.
> 
> When switching to blocking mode from spinning mode, it consists of
> 
> step 1) bumping up blocking readers counter and
> step 2) read_unlock()/write_unlock(),
> 
> this has caused serious ping-pong effect if there're a great amount of
> concurrent readers/writers, as waiters will be woken up and go to
> sleep immediately.

I think this ping-pong needs to be explained in a bit more detail.

Looking at the code it seems the issue happens when we have a path
locked in spinning mode (via btrfs_tree_lock) - in this case we have
blocking_readers/writes == 0 and write_lock acquired. Then we could
potentially have multiple requests to lock the tree (in this case the
root) and since the current holder is spinning then btrfs_tree_lock
callers will block on the write_lock. Subsequently when the holder
switches to blocking he calls
btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
increments the blocking reader/writes and calls the unlock at which
point a "thundering herd" problem occurs on the lock. Am I correct?

Furthermore I think the problem occurs not in btrfs_clear_path_blocking
but rather in the initial call to btrfs_set_path_blocking.

The way I see it the following sequence of operations occur:

1. Call btrfs_set_path_blocking: increments
blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
the lock types for the nodes in the path are
BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING

2. Some time later we call btrfs_clear_path_blocking
which also calls btrfs_set_path_blocking, however this time this
function doesn't do anything because the lock types in path struct are
already blocking and none of the 2 conditions inside
btrfs_set_lock_blocking_rw will match hence this code won't be executed.

Did I miss something ?

> 
> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> files creation script,
> 
> MNT=/mnt/btrfs
> mkfs.btrfs -f /dev/sdf
> mount /dev/def $MNT
> time fsmark  -D  1  -S0  -n  10  -s  0  -L  1 -l /tmp/fs_log.txt \
> -d  $MNT/0  -d  $MNT/1 \
> -d  $MNT/2  -d  $MNT/3 \
> -d  $MNT/4  -d  $MNT/5 \
> -d  $MNT/6  -d  $MNT/7 \
> -d  $MNT/8  -d  $MNT/9 \
> -d  $MNT/10  -d  $MNT/11 \
> -d  $MNT/12  -d  $MNT/13 \
> -d  $MNT/14  -d  $MNT/15
> 
> w/o patch:
> real2m27.307s
> user0m12.839s
> sys 13m42.831s
> 
> w/ patch:
> real1m2.273s
> user0m15.802s
> sys 8m16.495s
> 
> 2) dbench also proves the improvement,
> dbench -t 120 -D /mnt/btrfs 16
> 
> w/o patch:
> Throughput 158.363 MB/sec
> 
> w/ patch:
> Throughput 449.52 MB/sec
> 
> 3) xfstests didn't show any additional failures.
> 
> One thing to note is that callers may set leave_spinning to have all
> nodes in the path stay in spinning mode, which means callers are ready
> to not sleep before releasing the path, but it won't cause problems if
> they don't want to sleep in blocking mode, IOW, we can just get rid of
> leave_spinning.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ctree.c | 57 
> 
>  fs/btrfs/ctree.h |  2 --
>  fs/btrfs/delayed-inode.c |  3 ---
>  3 files changed, 4 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index d436fb4c002e..8b31caa60b0a 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
>   }
>  }
>  
> -/*
> - * reset all the locked nodes in the patch to spinning locks.
> - *
> - * held is used to keep lockdep happy, when lockdep is enabled
> - * we set held to a blocking lock before we go around and
> - * retake all the spinlocks in the path.  You can safely use NULL
> - * for held
> - */
> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> - struct extent_buffer *held, int held_rw)
> -{
> - int i;
> -
> - if (held) {
> - btrfs_set_lock_blocking_rw(held, held_rw);
> - if (held_rw == BTRFS_WRITE_LOCK)
> - held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> - else if (held_rw == BTRFS_READ_LOCK)
> - held_rw = BTRFS_READ_LOCK_BLOCKING;
> - }
> -