Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Sebastian Ochmann

On 04.11.18 19:31, Duncan wrote:

[This mail was also posted to gmane.comp.file-systems.btrfs.]

Sebastian Ochmann posted on Sun, 04 Nov 2018 14:15:55 +0100 as
excerpted:


Hello,

I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive
which stopped working correctly.



Kernel 4.18.16 (Arch Linux)


I see upgrading to 4.19 seems to have solved your problem, but this is
more about something I saw in the trace that has me wondering...


[  368.267315]  touch_atime+0xc0/0xe0


Do you have any atime-related mount options set?


That's an interesting point. On some machines, I have explicitly set 
"noatime", but on that particular system, I did not, thus using the 
"relatime" option as per default. Since I'm not using mutt or anything 
else (that I'm aware of) that exploits this feature, I will set noatime 
there as well.



FWIW, noatime is strongly recommended on btrfs.

Now I'm not a dev, just a btrfs user and list regular, and I don't know
if that function is called and just does nothing when noatime is set,
so you may well already have it set and this is "much ado about
nothing", but the chance that it's relevant, if not for you, perhaps
for others that may read it, begs for this post...

The problem with atime, access time, is that it turns most otherwise
read- only operations into read-and-write operations in ordered to
update the access time.  And on copy-on-write (COW) based filesystems
such as btrfs, that can be a big problem, because updating that tiny
bit of metadata will trigger a rewrite of the entire metadata block
containing it, which will trigger an update of the metadata for /that/
block in the parent metadata tier... all the way up the metadata tree,
ultimately to its root, the filesystem root and the superblocks, at the
next commit (normally every 30 seconds or less).

Not only is that a bunch of otherwise unnecessary work for a bit of
metadata barely anything actually uses, but forcing most read
operations to read-write obviously compounds the risk for all of those
would-be read- only operations when a filesystem already has problems.

Additionally, if your use-case includes regular snapshotting, with
atime on, on mostly read workloads with few writes (other than atime
updates), it may actually be the case that most of the changes in a
snapshot are actually atime updates, making reoccurring snapshot
updates far larger than they'd be otherwise.

Now a few years ago the kernel did change the default to relatime,
basically updating the atime for any particular file only once a day,
which does help quite a bit, and on traditional filesystems it's
arguably a reasonably sane default, but COW makes atime tracking enough
more expensive that setting noatime is still strongly recommended on
btrfs, particularly if you're doing regular snapshotting.

So do consider adding noatime to your mount options if you haven't done
so already.  AFAIK, the only /semi-common/ app that actually uses
atimes these days is mutt (for read-message tracking), and then not for
mbox, so you should be safe to at least test turning it off.

And YMMV, but if you do use mutt or something else that uses atimes,
I'd go so far as to recommend finding an alternative, replacing either
btrfs (because as I said, relatime is arguably enough on a traditional
non-COW filesystem) or whatever it is that uses atimes, your call,
because IMO it really is that big a deal.

Meanwhile, particularly after seeing that in the trace, if the 4.19
update hadn't already fixed it, I'd have suggested trying a read-only
mount, both as a test, and assuming it worked, at least allowing you to
access the data without the lockup, which would have then been related
to the write due to the atime update, not the actual read.


It would be nice to have a 1:1 image of the filesystem (or rather the 
raw block device) for more testing, but unfortunately I don't have 
another 10 TB drive lying around. :) I didn't really expect the 4.19 
upgrade to (apparently) fix the problem right away, so I also couldn't 
test the mentioned patch, but yeah... If it happens again (which for 
some reason I don't hope), I'll try you suggestion.



Actually, a read-only mount test is always a good troubleshooting step
when the trouble is a filesystem that either won't mount normally, or
will, but then locks up when you try to access something.  It's far
lest risky than a normal writable mount, and at minimum it provides you
the additional test data of whether it worked or not, plus if it does,
a chance to access the data and make sure your backups are current,
before actually trying to do any repairs.



Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Sebastian Ochmann

Thank you very much for the quick reply.

On 04.11.18 14:37, Qu Wenruo wrote:



On 2018/11/4 下午9:15, Sebastian Ochmann wrote:

Hello,

I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive which
stopped working correctly. The drive is used as a backup drive with zstd
compression to which I regularly rsync and make daily snapshots. After I
routinely removed a bunch of snapshots (about 20), I noticed later that
the machine would hang when trying to unmount the filesystem. The
current state is that I'm able to mount the filesystem without errors
and I can view (ls) files in the root level, but trying to view contents
of directories contained therein hangs just like when trying to unmount
the filesystem. I have not yet tried to run check, repair, etc. Do you
have any advice what I should try next?


Could you please run "btrfs check" on the umounted fs?


I ran btrfs check on the unmounted fs and it reported no errors.



A notable hardware change I did a few days before the problem is a
switch from an Intel Xeon platform to AMD Threadripper. However, I
haven't seen problems with the rest of the btrfs filesystems (in
particular, a RAID-1 consisting of three HDDs), which I also migrated to
the new platform, yet. I just want to mention it in case there are known
issues in that direction.

Kernel 4.18.16 (Arch Linux)
btrfs-progs 4.17.1

Kernel log after trying to "ls" a directory contained in the
filesystem's root directory:

[   79.279349] BTRFS info (device dm-5): use zstd compression, level 0
[   79.279351] BTRFS info (device dm-5): disk space caching is enabled
[   79.279352] BTRFS info (device dm-5): has skinny extents
[  135.202344] kauditd_printk_skb: 2 callbacks suppressed
[  135.202347] audit: type=1130 audit(1541335770.667:45): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=polkit comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  135.364850] audit: type=1130 audit(1541335770.831:46): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=udisks2 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  135.589255] audit: type=1130 audit(1541335771.054:47): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=rtkit-daemon comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  368.266653] INFO: task kworker/u256:1:728 blocked for more than 120
seconds.
[  368.266657]   Tainted: P   OE 4.18.16-arch1-1-ARCH #1
[  368.266658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  368.20] kworker/u256:1  D    0   728  2 0x8080
[  368.266680] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
[btrfs]
[  368.266681] Call Trace:
[  368.266687]  ? __schedule+0x29b/0x8b0
[  368.266690]  ? preempt_count_add+0x68/0xa0
[  368.266692]  schedule+0x32/0x90
[  368.266707]  btrfs_tree_read_lock+0x7d/0x110 [btrfs]
[  368.266710]  ? wait_woken+0x80/0x80
[  368.266719]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[  368.266729]  btrfs_search_slot+0xf6/0xa00 [btrfs]
[  368.266732]  ? _raw_spin_unlock+0x16/0x30
[  368.266734]  ? inode_insert5+0x105/0x1a0
[  368.266746]  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
[  368.266749]  ? kmem_cache_alloc+0x179/0x1d0
[  368.266762]  btrfs_iget+0x113/0x690 [btrfs]
[  368.266764]  ? _raw_spin_unlock+0x16/0x30
[  368.266778]  __lookup_free_space_inode+0xd8/0x150 [btrfs]
[  368.266792]  lookup_free_space_inode+0x63/0xc0 [btrfs]
[  368.266806]  load_free_space_cache+0x6e/0x190 [btrfs]
[  368.266808]  ? kmem_cache_alloc_trace+0x181/0x1d0
[  368.266817]  ? cache_block_group+0x73/0x3e0 [btrfs]
[  368.266827]  cache_block_group+0x1c1/0x3e0 [btrfs]


This thread is trying to get tree root lock to create free space cache,
while some one already has locked the tree root.


[  368.266829]  ? wait_woken+0x80/0x80
[  368.266839]  find_free_extent+0x872/0x10e0 [btrfs]
[  368.266851]  btrfs_reserve_extent+0x9b/0x180 [btrfs]
[  368.266862]  btrfs_alloc_tree_block+0x1b3/0x4d0 [btrfs]
[  368.266872]  __btrfs_cow_block+0x11d/0x500 [btrfs]
[  368.266882]  btrfs_cow_block+0xdc/0x1a0 [btrfs]
[  368.266891]  btrfs_search_slot+0x282/0xa00 [btrfs]
[  368.266893]  ? _raw_spin_unlock+0x16/0x30
[  368.266903]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[  368.266913]  __btrfs_run_delayed_refs+0x8ef/0x10a0 [btrfs]
[  368.266915]  ? preempt_count_add+0x68/0xa0
[  368.266926]  btrfs_run_delayed_refs+0x72/0x180 [btrfs]
[  368.266937]  delayed_ref_async_start+0x81/0x90 [btrfs]
[  368.266950]  normal_work_helper+0xbd/0x350 [btrfs]
[  368.266953]  process_one_work+0x1eb/0x3c0
[  368.266955]  worker_thread+0x2d/0x3d0
[  368.266956]  ? process_one_work+0x3c0/0x3c0
[  368.266958]  kthread+0x112/0x130
[  368.266960]  ? kthread_flush_work_fn+0x10/0x10
[  368.266961]  ret_from_fork+0x22/0x40
[  368.266978] INFO: task btrfs-cleaner:1196 blocked for more than 120

Filesystem mounts fine but hangs on access

2018-11-04 Thread Sebastian Ochmann

Hello,

I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive which 
stopped working correctly. The drive is used as a backup drive with zstd 
compression to which I regularly rsync and make daily snapshots. After I 
routinely removed a bunch of snapshots (about 20), I noticed later that 
the machine would hang when trying to unmount the filesystem. The 
current state is that I'm able to mount the filesystem without errors 
and I can view (ls) files in the root level, but trying to view contents 
of directories contained therein hangs just like when trying to unmount 
the filesystem. I have not yet tried to run check, repair, etc. Do you 
have any advice what I should try next?


A notable hardware change I did a few days before the problem is a 
switch from an Intel Xeon platform to AMD Threadripper. However, I 
haven't seen problems with the rest of the btrfs filesystems (in 
particular, a RAID-1 consisting of three HDDs), which I also migrated to 
the new platform, yet. I just want to mention it in case there are known 
issues in that direction.


Kernel 4.18.16 (Arch Linux)
btrfs-progs 4.17.1

Kernel log after trying to "ls" a directory contained in the 
filesystem's root directory:


[   79.279349] BTRFS info (device dm-5): use zstd compression, level 0
[   79.279351] BTRFS info (device dm-5): disk space caching is enabled
[   79.279352] BTRFS info (device dm-5): has skinny extents
[  135.202344] kauditd_printk_skb: 2 callbacks suppressed
[  135.202347] audit: type=1130 audit(1541335770.667:45): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=polkit comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  135.364850] audit: type=1130 audit(1541335770.831:46): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=udisks2 comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  135.589255] audit: type=1130 audit(1541335771.054:47): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=rtkit-daemon comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  368.266653] INFO: task kworker/u256:1:728 blocked for more than 120 
seconds.

[  368.266657]   Tainted: P   OE 4.18.16-arch1-1-ARCH #1
[  368.266658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  368.20] kworker/u256:1  D0   728  2 0x8080
[  368.266680] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  368.266681] Call Trace:
[  368.266687]  ? __schedule+0x29b/0x8b0
[  368.266690]  ? preempt_count_add+0x68/0xa0
[  368.266692]  schedule+0x32/0x90
[  368.266707]  btrfs_tree_read_lock+0x7d/0x110 [btrfs]
[  368.266710]  ? wait_woken+0x80/0x80
[  368.266719]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[  368.266729]  btrfs_search_slot+0xf6/0xa00 [btrfs]
[  368.266732]  ? _raw_spin_unlock+0x16/0x30
[  368.266734]  ? inode_insert5+0x105/0x1a0
[  368.266746]  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
[  368.266749]  ? kmem_cache_alloc+0x179/0x1d0
[  368.266762]  btrfs_iget+0x113/0x690 [btrfs]
[  368.266764]  ? _raw_spin_unlock+0x16/0x30
[  368.266778]  __lookup_free_space_inode+0xd8/0x150 [btrfs]
[  368.266792]  lookup_free_space_inode+0x63/0xc0 [btrfs]
[  368.266806]  load_free_space_cache+0x6e/0x190 [btrfs]
[  368.266808]  ? kmem_cache_alloc_trace+0x181/0x1d0
[  368.266817]  ? cache_block_group+0x73/0x3e0 [btrfs]
[  368.266827]  cache_block_group+0x1c1/0x3e0 [btrfs]
[  368.266829]  ? wait_woken+0x80/0x80
[  368.266839]  find_free_extent+0x872/0x10e0 [btrfs]
[  368.266851]  btrfs_reserve_extent+0x9b/0x180 [btrfs]
[  368.266862]  btrfs_alloc_tree_block+0x1b3/0x4d0 [btrfs]
[  368.266872]  __btrfs_cow_block+0x11d/0x500 [btrfs]
[  368.266882]  btrfs_cow_block+0xdc/0x1a0 [btrfs]
[  368.266891]  btrfs_search_slot+0x282/0xa00 [btrfs]
[  368.266893]  ? _raw_spin_unlock+0x16/0x30
[  368.266903]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[  368.266913]  __btrfs_run_delayed_refs+0x8ef/0x10a0 [btrfs]
[  368.266915]  ? preempt_count_add+0x68/0xa0
[  368.266926]  btrfs_run_delayed_refs+0x72/0x180 [btrfs]
[  368.266937]  delayed_ref_async_start+0x81/0x90 [btrfs]
[  368.266950]  normal_work_helper+0xbd/0x350 [btrfs]
[  368.266953]  process_one_work+0x1eb/0x3c0
[  368.266955]  worker_thread+0x2d/0x3d0
[  368.266956]  ? process_one_work+0x3c0/0x3c0
[  368.266958]  kthread+0x112/0x130
[  368.266960]  ? kthread_flush_work_fn+0x10/0x10
[  368.266961]  ret_from_fork+0x22/0x40
[  368.266978] INFO: task btrfs-cleaner:1196 blocked for more than 120 
seconds.

[  368.266980]   Tainted: P   OE 4.18.16-arch1-1-ARCH #1
[  368.266981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  368.266982] btrfs-cleaner   D0  1196  2 0x8080
[  368.266983] Call Trace:
[  368.266985]  ? __schedule+0x29b/0x8b0
[  368.266987]  schedule+0x32/0x90
[  368.266997]  cache_block_group+0x148/0x3e0 [btrfs]
[  368.266998]  ? wait_woken+0x80/0x80
[  

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Sebastian Ochmann

Hello,

I attached to the ffmpeg-mux process for a little while and pasted the 
result here:


https://pastebin.com/XHaMLX8z

Can you help me with interpreting this result? If you'd like me to run 
strace with specific options, please let me know. This is a level of 
debugging I'm not dealing with on a daily basis. :)


Best regards
Sebastian


On 22.01.2018 20:08, Chris Mason wrote:

On 01/22/2018 01:33 PM, Sebastian Ochmann wrote:

[ skipping to the traces ;) ]


2866 ffmpeg-mux D
[] btrfs_start_ordered_extent+0x101/0x130 [btrfs]
[] lock_and_cleanup_extent_if_need+0x340/0x380 [btrfs]
[] __btrfs_buffered_write+0x261/0x740 [btrfs]
[] btrfs_file_write_iter+0x20f/0x650 [btrfs]
[] __vfs_write+0xf9/0x170
[] vfs_write+0xad/0x1a0
[] SyS_write+0x52/0xc0
[] entry_SYSCALL_64_fastpath+0x1a/0x7d
[] 0x


This is where we wait for writes that are already in flight before we're 
allowed to redirty those pages in the file.  It'll happen when we either 
overwrite a page in the file that we've already written, or when we're 
trickling down writes slowly in non-4K aligned writes.


You can probably figure out pretty quickly which is the case by stracing 
ffmpeg-mux.  Since lower dirty ratios made it happen more often for you, 
my guess is the app is sending down unaligned writes.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Sebastian Ochmann
First off, thank you for all the responses! Let me reply to multiple 
suggestions at once in this mail.


On 22.01.2018 01:39, Qu Wenruo wrote:

Either such mount option has a bug, or some unrelated problem.

As you mentioned the output is about 10~50MiB/s, 30s means 300~1500MiBs.
Maybe it's related to the dirty data amount?

Would you please verify if a lower or higher profile (resulting much
larger or smaller data stream) would affect?


A much lower rate seems to mitigate the problem somewhat, however I'm 
talking about low single-digit MB/s when the problem seems to vanish. 
But even with low, but more realistic, amounts of data the drops still 
happen.



Despite that, I'll dig to see if commit= option has any bug.

And you could also try the nospace_cache mount option provided by Chris
Murphy, which may also help.


I tried the nospace_cache option but it doesn't seem to make a 
difference to me.


On 22.01.2018 15:27, Chris Mason wrote:
> This could be a few different things, trying without the space cache was
> already suggested, and that's a top suspect.
>
> How does the application do the writes?  Are they always 4K aligned or
> does it send them out in odd sizes?
>
> The easiest way to nail it down is to use offcputime from the iovisor
> project:
>
>
> https://github.com/iovisor/bcc/blob/master/tools/offcputime.py
>
> If you haven't already configured this it may take a little while, but
> it's the perfect tool for this problem.
>
> Otherwise, if the stalls are long enough you can try to catch it with
> /proc//stack.  I've attached a helper script I often use to dump
> the stack trace of all the tasks in D state.
>
> Just run walker.py and it should give you something useful.  You can use
> walker.py -a to see all the tasks instead of just D state.  This just
> walks /proc//stack, so you'll need to run it as someone with
> permissions to see the stack traces of the procs you care about.
>
> -chris

I tried the walker.py script and was able to catch stack traces when the 
lag happens. I'm pasting two traces at the end of this mail - one when 
it happened using a USB-connected HDD and one when it happened on a SATA 
SSD. The latter is encrypted, hence the dmcrypt_write process. Note 
however that my original problem appeared on a SSD that was not encrypted.


In reply to the mail by Duncan:

64 GB RAM...

Do you know about the /proc/sys/vm/dirty_* files and how to use/tweak 
them?  If not, read $KERNDIR/Documentation/sysctl/vm.txt, focusing on 
these files.


At least I have never tweaked those settings yet. I certainly didn't 
know about the foreground/background distinction, that is really 
interesting. Thank you for the very extensive info and guide btw!


So try setting something a bit more reasonable and see if it helps.  That 
1% ratio at 16 GiB RAM for ~160 MB was fine for me, but I'm not doing 
critical streaming, and at 64 GiB you're looking at ~640 MB per 1%, as I 
said, too chunky.  For streaming, I'd suggest something approaching the 
value of your per-second IO bandwidth, we're assuming 100 MB/sec here so 
100 MiB but let's round that up to a nice binary 128 MiB, for the 
background value, perhaps half a GiB or 5 seconds worth of writeback time 
for foreground, 4 times the background value.  So:


vm.dirty_background_bytes = 134217728   # 128*1024*1024, 128 MiB
vm.dirty_bytes = 536870912  # 512*1024*1024, 512 MiB


Now I have good and bad news. The good news is that setting these 
tunables to different values does change something. The bad news is that 
lowering these values only seems to let the lag and frame drops happen 
quicker/more frequently. I have also tried lowering the background bytes 
to, say, 128 MB but the non-background bytes to 1 or 2 GB, but even the 
background task seems to already have a bad enough effect to start 
dropping frames. :( When writing to the SSD, the effect seems to be 
mitigated a little bit, but still frame drops are quickly occurring 
which is unacceptable given that the system is generally able to do better.


By the way, as you can see from the stack traces, in the SSD case blk_mq 
is in use.


But I know less about that stuff and it's googlable, should you decide to 
try playing with it too.  I know what the dirty_* stuff does from 
personal experience. =:^)


"I know what the dirty_* stuff does from personal experience. =:^)" 
sounds quite interesting... :D



Best regards and thanks again
Sebastian


First stack trace:

690 usb-storage D
[] usb_sg_wait+0xf4/0x150 [usbcore]
[] usb_stor_bulk_transfer_sglist.part.1+0x63/0xb0 
[usb_storage]

[] usb_stor_bulk_srb+0x49/0x80 [usb_storage]
[] usb_stor_Bulk_transport+0x163/0x3d0 [usb_storage]
[] usb_stor_invoke_transport+0x37/0x4c0 [usb_storage]
[] usb_stor_control_thread+0x1d8/0x2c0 [usb_storage]
[] kthread+0x118/0x130
[] ret_from_fork+0x1f/0x30
[] 0x

2505 kworker/u16:2 D
[] io_schedule+0x12/0x40
[] wbt_wait+0x1b8/0x340
[] blk_mq_make_request+0xe6/0x6e0
[] 

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-21 Thread Sebastian Ochmann

On 21.01.2018 23:04, Chris Murphy wrote:

On Sun, Jan 21, 2018 at 8:27 AM, Sebastian Ochmann
<ochm...@cs.uni-bonn.de> wrote:

On 21.01.2018 11:04, Qu Wenruo wrote:



The output of "mount" after setting 10 seconds commit interval:

/dev/sdc1 on /mnt/rec type btrfs
(rw,relatime,space_cache,commit=10,subvolid=5,subvol=/)


I wonder if it gets stuck updating v1 space cache. Instead of trying
v2, you could try nospace_cache mount option and see if there's a
change in behavior.


I tried disabling space_cache, also on a newly formatted volume when 
first mounting it. However, it doesn't seem to make a difference. 
Basically the same lags in the same interval, sorry.


Best regards
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-21 Thread Sebastian Ochmann

On 21.01.2018 11:04, Qu Wenruo wrote:



On 2018年01月20日 18:47, Sebastian Ochmann wrote:

Hello,

I would like to describe a real-world use case where btrfs does not
perform well for me. I'm recording 60 fps, larger-than-1080p video using
OBS Studio [1] where it is important that the video stream is encoded
and written out to disk in real-time for a prolonged period of time (2-5
hours). The result is a H264 video encoded on the GPU with a data rate
ranging from approximately 10-50 MB/s.




The hardware used is powerful enough to handle this task. When I use a
XFS volume for recording, no matter whether it's a SSD or HDD, the
recording is smooth and no frame drops are reported (OBS has a nice
Stats window where it shows the number of frames dropped due to encoding
lag which seemingly also includes writing the data out to disk).

However, when using a btrfs volume I quickly observe severe, periodic
frame drops. It's not single frames but larger chunks of frames that a
dropped at a time. I tried mounting the volume with nobarrier but to no
avail.


What's the drop internal? Something near 30s?
If so, try mount option commit=300 to see if it helps.


Thank you for your reply. I observed the interval more closely and it 
shows that the first, quite small drop occurs about 10 seconds after 
starting the recording (some initial metadata being written?). After 
that, the interval is indeed about 30 seconds with large drops each time.


Thus I tried setting the commit option to different values. I confirmed 
that the setting was activated by looking at the options "mount" shows 
(see below). However, no matter whether I set the commit interval to 
300, 60 or 10 seconds, the results were always similar. About every 30 
seconds the drive shows activity for a few seconds and the drop occurs 
shortly thereafter. It almost seems like the commit setting doesn't have 
any effect. By the way, the machine I'm currently testing on has 64 GB 
of RAM so it should have plenty of room for caching.




Of course, the simple fix is to use a FS that works for me(TM). However
I thought since this is a common real-world use case I'd describe the
symptoms here in case anyone is interested in analyzing this behavior.
It's not immediately obvious that the FS makes such a difference. Also,
if anyone has an idea what I could try to mitigate this issue (mount or
mkfs options?) I can try that.


Mkfs.options can help, but only marginally AFAIK.

You could try mkfs with -n 4K (minimal supported nodesize), to reduce
the tree lock critical region by a little, at the cost of more metadata
fragmentation.

And is there any special features enabled like quota?
Or scheduled balance running at background?
Which is known to dramatically impact performance of transaction
commitment, so it's recommended to disable quota/scheduled balance first.


Another recommendation is to use nodatacow mount option to reduce the
CoW metadata overhead, but I doubt about the effectiveness.


I tried the -n 4K and nodatacow options, but it doesn't seem to make a 
big difference, if at all. No quota or auto-balance is active. It's 
basically using Arch Linux default options.


The output of "mount" after setting 10 seconds commit interval:

/dev/sdc1 on /mnt/rec type btrfs 
(rw,relatime,space_cache,commit=10,subvolid=5,subvol=/)


Also tried noatime, but didn't make a difference either.

Best regards
Sebastian


Thanks,
Qu >>

I saw this behavior on two different machines with kernels 4.14.13 and
4.14.5, both Arch Linux. btrfs-progs 4.14, OBS 20.1.3-241-gf5c3af1b
built from git.

Best regards
Sebastian

[1] https://github.com/jp9000/obs-studio
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Periodic frame losses when recording to btrfs volume with OBS

2018-01-20 Thread Sebastian Ochmann

Hello,

I would like to describe a real-world use case where btrfs does not 
perform well for me. I'm recording 60 fps, larger-than-1080p video using 
OBS Studio [1] where it is important that the video stream is encoded 
and written out to disk in real-time for a prolonged period of time (2-5 
hours). The result is a H264 video encoded on the GPU with a data rate 
ranging from approximately 10-50 MB/s.


The hardware used is powerful enough to handle this task. When I use a 
XFS volume for recording, no matter whether it's a SSD or HDD, the 
recording is smooth and no frame drops are reported (OBS has a nice 
Stats window where it shows the number of frames dropped due to encoding 
lag which seemingly also includes writing the data out to disk).


However, when using a btrfs volume I quickly observe severe, periodic 
frame drops. It's not single frames but larger chunks of frames that a 
dropped at a time. I tried mounting the volume with nobarrier but to no 
avail.


Of course, the simple fix is to use a FS that works for me(TM). However 
I thought since this is a common real-world use case I'd describe the 
symptoms here in case anyone is interested in analyzing this behavior. 
It's not immediately obvious that the FS makes such a difference. Also, 
if anyone has an idea what I could try to mitigate this issue (mount or 
mkfs options?) I can try that.


I saw this behavior on two different machines with kernels 4.14.13 and 
4.14.5, both Arch Linux. btrfs-progs 4.14, OBS 20.1.3-241-gf5c3af1b 
built from git.


Best regards
Sebastian

[1] https://github.com/jp9000/obs-studio
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-freespace ever doing anything?

2017-07-31 Thread Sebastian Ochmann

On 31.07.2017 14:08, Austin S. Hemmelgarn wrote:

On 2017-07-31 06:51, Sebastian Ochmann wrote:

Hello,

I have a quite simple and possibly stupid question. Since I'm 
occasionally seeing warnings about failed loading of free space cache, 
I wanted to clear and rebuild space cache. So I mounted the 
filesystem(s) with -o clear_cache and subsequently with my regular 
options which includes space_cache. Indeed, dmesg tells me:


[   60.285190] BTRFS info (device dm-1): force clearing of disk cache

and then

[  137.151845] BTRFS info (device dm-1): use ssd allocation scheme
[  137.151850] BTRFS info (device dm-1): disk space caching is enabled
[  137.151852] BTRFS info (device dm-1): has skinny extents

To my understanding, btrfs-freespace should then start working to 
rebuild the free space cache. However, I can't remember that I have 
ever seen btrfs work hard after clearing the space cache. The drives 
aren't working much, and the btrfs-freespace processes (which are 
indeed there) don't do anything either.


So simple question: Can anyone try to clear their space cache and 
confirm that btrfs actually does something after doing so? Is there 
anything I could do to confirm that something is happening?
Based on my (limited) understanding of that code, assuming you're using 
the original free space cache (which I think is the case, since you said 
you're using the regular 'space_cache' option instead of 
'space_cache=v2'), there's not _much_ work that needs to be done unless 
free space is heavily fragmented and the disk is reasonably full.  The 
original free space cache is pretty similar to an allocation bitmap, and 
computing that is not hard to do (you just figure out which blocks are 
actually used).


Based on my own experience, you'll see almost zero activity most of the 
time when rebuilding the free space cache regardless of which you are 
using (the original, or the new version), although the newer free space 
tree code appears to do a bit more work.


Ah, that's interesting, I have to admit I wasn't even aware of 
space_cache v2. That said, the btrfs wiki doesn't state its existence on 
the mount options page. It's only mentioned at the bottom of the Status 
page where clearing the space cache for v2 using "btrfs check" is explained.


The man page of "btrfs-check" states something interesting regarding the 
clearing of space cache using the "clear_cache" mount option when using v1:


"For free space cache v1, the clear_cache kernel mount option only 
rebuilds the free space cache for block groups that are modified while 
the filesystem is mounted with that option."


So "clear_cache" is, to my understanding, pretty much a misnomer. Only 
for v2, it actually clears the whole cache. I now used "btrfs check 
--clear-space-cache v1" on one of the devices and it took a while to 
clear the cache (way longer than when using the clear_cache mount 
option) (rebuilding still seems to be quick though).


The explanation of the space_cache and clear_cache options in the Wiki 
should be updated. The mount options page doesn't mention space_cache v2 
and the clear_cache option supposedly clears "all the free space caches" 
according to the wiki which contradicts the btrfs check manpage.


The drives in question are a SSD and a HDD, both in the range of 1-2 
TB in size.


I'm on Arch Linux, kernel 4.12.3, btrfs-progs 4.11.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-freespace ever doing anything?

2017-07-31 Thread Sebastian Ochmann

Hello,

I have a quite simple and possibly stupid question. Since I'm 
occasionally seeing warnings about failed loading of free space cache, I 
wanted to clear and rebuild space cache. So I mounted the filesystem(s) 
with -o clear_cache and subsequently with my regular options which 
includes space_cache. Indeed, dmesg tells me:


[   60.285190] BTRFS info (device dm-1): force clearing of disk cache

and then

[  137.151845] BTRFS info (device dm-1): use ssd allocation scheme
[  137.151850] BTRFS info (device dm-1): disk space caching is enabled
[  137.151852] BTRFS info (device dm-1): has skinny extents

To my understanding, btrfs-freespace should then start working to 
rebuild the free space cache. However, I can't remember that I have ever 
seen btrfs work hard after clearing the space cache. The drives aren't 
working much, and the btrfs-freespace processes (which are indeed there) 
don't do anything either.


So simple question: Can anyone try to clear their space cache and 
confirm that btrfs actually does something after doing so? Is there 
anything I could do to confirm that something is happening?


The drives in question are a SSD and a HDD, both in the range of 1-2 TB 
in size.


I'm on Arch Linux, kernel 4.12.3, btrfs-progs 4.11.1

Best regards
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mounting RAID1 degraded+rw only works once

2016-03-19 Thread Sebastian Ochmann

Hello,

I'm trying to understand how to correctly handle and recover from 
degraded RAID1 setups in btrfs. In particular, I don't understand a 
behavior I'm seeing which somehow takes part of the advantage away from 
the idea of having a RAID for me.


The main issue I have is as follows. I can mount a RAID1 with missing 
devices using the "degraded" option. So far so good. I know I should add 
another device at that point, but let's say I don't have a device ready 
but need to keep my system running. The problem is that once I write 
some data to the device and unmount it, I cannot mount it degraded+rw 
again but only degraded+ro. And at that point I can't make changes such 
as adding new devices and rebalancing as far as I see, rendering the 
degraded RAID useless for me for keeping the system running.


Example:

- Create RAID1 (data+metadata) with 2 devices

- mount the filesystem

- btrfs fi df /something:

Data, RAID1: total=112.00MiB, used=40.76MiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=32.00MiB, used=768.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

- unmount

- Destroy one of the devices (dd over it or whatever)

- mount with "-o degraded"

- Write some data to the device (not sure if this is strictly necessary).

- unmount the filesystem

- Try to mount with "-o degraded" again. It doesn't allow you to do so 
(if it does, try to unmount and mount again, sometimes it needs two 
tries). dmesg says "missing devices(1) exceeds the limit(0), writable 
mount is not allowed"


- mounting with "-o degraded,ro" works.

- btrfs fi df [dev]:

Data, RAID1: total=112.00MiB, used=40.76MiB
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=8.00MiB, used=0.00B
Metadata, RAID1: total=32.00MiB, used=768.00KiB
Metadata, single: total=28.00MiB, used=80.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Now there is some RAID1 and some single data. Which could make sense 
since writing to the device in degraded mode could only be done on the 
single device. Still, I cannot make changes to the filesystem now and 
can "only" recover the data from it, but that's not really the idea of a 
RAID1 in my opinion.


Any advice?

Versions:
- Kernel 4.4.5-1-ARCH
- btrfs-progs 4.4.1

Best regards
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)?

2014-07-17 Thread Sebastian Ochmann

Hello,

I need to clarify, I'm _not_ sharing a drive between multiple computers 
at the _same_ time. It's a portable device which I use at different 
locations with different computers. I just wanted to give a rationale 
for mounting the whole drive to some mountpoint and then also part of 
that drive (a subvolume) to the respective computer's /home mountpoint. 
So it's controlled by the same kernel in the same computer, it's just 
that part of the filesystem is mounted at multiple mountpoints, much 
like a bind-mount, but I'm interested in mounting a subvolume of the 
already-mounted volume to some other mountpoint. Sorry for the confusion.


Best regards
Sebastian


On 17.07.2014 01:18, Chris Murphy wrote:


On Jul 16, 2014, at 4:18 PM, Sebastian Ochmann ochm...@informatik.uni-bonn.de 
wrote:


Hello,

I'm sharing a btrfs-formatted drive between multiple computers and each of the 
machines has a separate home directory on that drive.


2+ computers writing to the same block device? I don't see how this is safe. 
Seems possibly a bug that the 1st mount event isn't setting some metadata so 
that another kernel instance knows not to allow another mount.


Chris Murphy


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why does btrfs defrag work worse than making a copy of a file?

2014-07-16 Thread Sebastian Ochmann

On 16.07.2014 09:53, Liu Bo wrote:

On Tue, Jul 15, 2014 at 11:17:26PM +0200, Sebastian Ochmann wrote:

Hello,

I have a VirtualBox hard drive image which is quite fragmented even
after very light use; it is 1.6 GB in size and has around 5000
fragments (I'm using filefrag to determine the number of
fragments). Doing a btrfs fi defrag -f image.vdi reduced the
number of fragments to 3749. Even doing a btrfs fi defrag -f -t 1
image.vdi which should make sure every extent is rewritten
(according to the btrfs-progs 3.14.2 manpage) does not yield any
better result and seems to return immediately. Copying the file,
however, yields a copy which has only 5 fragments (simply doing a cp
image.vdi image2.vdi; sync; filefrag image2.vdi).

What do I have to do to defrag the file to the minimal number of
fragments possible? Am I missing something?


So usually btrfs thinks of an extent whose size is bigger than 256K as a big
enough extent.

Another possible reason is that there is something wrong with btrfs_fiemap which
gives filefrag' a wrong output.

Would you please show us the 'filefrag -v' output?


Sure, I have pasted the output of filefrag -v here:

http://pastebin.com/kcZhVhkc

However, I think the problem is merely in the documentation (manpage of 
btrfs-filesystem). The description of the -t option is different in 
two locations and doesn't make sense in general, I think. It is first 
described as follows:


Any extent bigger than threshold given by -t option, will be considered 
already defragged. Use 0 to take the kernel default, and use 1 to say 
every single extent must be rewritten.


So I used -t 1 because I thought it will defrag as much as possible. 
However when thinking about it, any extent at least 1 byte (or 2 bytes?) 
in size will be ignored this way, am I correct?


Further below, the -t option is described as follows:

-t size  defragment only files at least size bytes big

Here, the option suddenly refers to the file size. In any case, doing a 
btrfs fi defrag -f -t 10G image.vdi defragged my file to the 5 extents 
I also get by simply copying the file. I think the documentation should 
be updated to reflect what the -t option actually does.


Best regards
Sebastian



thanks,
-liubo



Kernel version 3.15.5, btrfs progs 3.14.2, Arch Linux.

Best regards,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is it safe to mount subvolumes of already-mounted volumes (even with different options)?

2014-07-16 Thread Sebastian Ochmann

Hello,

I'm sharing a btrfs-formatted drive between multiple computers and each 
of the machines has a separate home directory on that drive. The root of 
the drive is mounted at /mnt/tray and the home directory for machine 
{hostname} is under /mnt/tray/Homes/{hostname}. Up until now, I have 
mounted /mnt/tray like a normal volume and then did an additional 
bind-mount of /mnt/tray/Homes/{hostname} to /home.


Now I have a new drive and wanted to do things a bit more advanced by 
creating subvolumes for each of the machines' home directories so that I 
can also do independent snapshotting. I guess I could use the bind-mount 
method like before but my question is if it is considered safe to do an 
additional, regular mount of one of the subvolumes to /home instead, like


mount /dev/sdxN /mnt/tray
mount -o subvol=/Homes/{hostname} /dev/sdxN /home

When I experimented with such additional mounts of subvolumes of 
already-mounted volumes, I noticed that the mount options of the 
additional subvolume mount might differ from the original mount. For 
instance, the root volume might be mounted with noatime while the 
subvolume mount may have relatime.


So my questions are: Is mounting a subvolume of an already mounted 
volume considered safe and are there any combinations of possibly 
conflicting mount options one should be aware of (compression, 
autodefrag, cache clearing)? Is it advisable to use the same mount 
options for all mounts pointing to the same physical device?


Best regards,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why does btrfs defrag work worse than making a copy of a file?

2014-07-15 Thread Sebastian Ochmann

Hello,

I have a VirtualBox hard drive image which is quite fragmented even 
after very light use; it is 1.6 GB in size and has around 5000 fragments 
(I'm using filefrag to determine the number of fragments). Doing a 
btrfs fi defrag -f image.vdi reduced the number of fragments to 3749. 
Even doing a btrfs fi defrag -f -t 1 image.vdi which should make sure 
every extent is rewritten (according to the btrfs-progs 3.14.2 manpage) 
does not yield any better result and seems to return immediately. 
Copying the file, however, yields a copy which has only 5 fragments 
(simply doing a cp image.vdi image2.vdi; sync; filefrag image2.vdi).


What do I have to do to defrag the file to the minimal number of 
fragments possible? Am I missing something?


Kernel version 3.15.5, btrfs progs 3.14.2, Arch Linux.

Best regards,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Meaning of \no_csum\ field when scrubbing with -R option

2014-02-20 Thread Sebastian Ochmann

Hello,


Sebastian Ochmann posted on Wed, 19 Feb 2014 13:58:17 +0100 as excerpted:


So my question is, why does scrub show a high (i.e. non-zero) value for
no_csum? I never enabled nodatasum or a similar option.

Did you enable nodatacow option? if  nodatacow option is enabled,
data checksums will be also disabled at the same time.


No, never, not even on single files. Some additional info: The 
filesystem is only a few weeks old (even though I see similar results on 
an older filesystem as well), it's my root filesystem, and as mount 
options I use rw,noatime,ssd,discard,space_cache (it's on a SSD). 
Kernel version is 3.12.9.


Best regards,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Meaning of no_csum field when scrubbing with -R option

2014-02-19 Thread Sebastian Ochmann

Hello everyone,

I have a question: What exactly does the value for no_csum mean when 
doing a scrub with the -R option? Example output:



 sudo btrfs scrub start -BR /

scrub done for ...
  ...
  csum_errors: 0
  verify_errors: 0
  no_csum: 70517
  csum_discards: 87381
  super_errors: 0
  ...


In the btrfs header, I found the following comment for the no_csum 
field of the btrfs_scrub_progress struct:



# of 4k data block for which no csum is present, probably the result of 
data written with nodatasum



So my question is, why does scrub show a high (i.e. non-zero) value for 
no_csum? I never enabled nodatasum or a similar option.


Best regards
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] Btrfs: fix wrong super generation mismatch when scrubbing supers

2013-12-04 Thread Sebastian Ochmann

Hello,

seems to be working for me (only tested using both parts of the patch); 
wasn't able to trigger the errors after almost an hour of stress-testing.


Best regards,
Sebastian

On 04.12.2013 14:15, Wang Shilong wrote:

We came a race condition when scrubbing superblocks, the story is:

In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!

We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)

Reported-by: Sebastian Ochmann ochm...@informatik.uni-bonn.de
Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
Reviewed-by: Miao Xie mi...@cn.fujitsu.com
---
v3-v4:
by checking @scrub_pause_req, block a scrubber
if we are committing transaction(thanks to Miao and Liu)
---
  fs/btrfs/scrub.c | 45 ++---
  1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2544805..d27f95e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -257,6 +257,7 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 offset, 
u64 root,
  static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
int mirror_num, u64 physical_for_dev_replace);
  static void copy_nocow_pages_worker(struct btrfs_work *work);
+static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);


  static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
@@ -270,6 +271,16 @@ static void scrub_pending_bio_dec(struct scrub_ctx *sctx)
wake_up(sctx-list_wait);
  }

+static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
+{
+   while (atomic_read(fs_info-scrub_pause_req)) {
+   mutex_unlock(fs_info-scrub_lock);
+   wait_event(fs_info-scrub_pause_wait,
+  atomic_read(fs_info-scrub_pause_req) == 0);
+   mutex_lock(fs_info-scrub_lock);
+   }
+}
+
  /*
   * used for workers that require transaction commits (i.e., for the
   * NOCOW case)
@@ -2330,14 +2341,10 @@ static noinline_for_stack int scrub_stripe(struct 
scrub_ctx *sctx,
btrfs_reada_wait(reada2);

mutex_lock(fs_info-scrub_lock);
-   while (atomic_read(fs_info-scrub_pause_req)) {
-   mutex_unlock(fs_info-scrub_lock);
-   wait_event(fs_info-scrub_pause_wait,
-  atomic_read(fs_info-scrub_pause_req) == 0);
-   mutex_lock(fs_info-scrub_lock);
-   }
+   scrub_blocked_if_needed(fs_info);
atomic_dec(fs_info-scrubs_paused);
mutex_unlock(fs_info-scrub_lock);
+
wake_up(fs_info-scrub_pause_wait);

/*
@@ -2377,15 +2384,12 @@ static noinline_for_stack int scrub_stripe(struct 
scrub_ctx *sctx,
atomic_set(sctx-wr_ctx.flush_all_writes, 0);
atomic_inc(fs_info-scrubs_paused);
wake_up(fs_info-scrub_pause_wait);
+
mutex_lock(fs_info-scrub_lock);
-   while (atomic_read(fs_info-scrub_pause_req)) {
-   mutex_unlock(fs_info-scrub_lock);
-   wait_event(fs_info-scrub_pause_wait,
-  atomic_read(fs_info-scrub_pause_req) == 0);
-   mutex_lock(fs_info-scrub_lock);
-   }
+   scrub_blocked_if_needed(fs_info);
atomic_dec(fs_info-scrubs_paused);
mutex_unlock(fs_info-scrub_lock);
+
wake_up(fs_info-scrub_pause_wait);
}

@@ -2707,14 +2711,10 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
   atomic_read(sctx-workers_pending) == 0);

mutex_lock(fs_info-scrub_lock);
-   while (atomic_read(fs_info-scrub_pause_req)) {
-   mutex_unlock(fs_info-scrub_lock);
-   wait_event(fs_info-scrub_pause_wait,
-  atomic_read(fs_info-scrub_pause_req) == 0);
-   mutex_lock(fs_info-scrub_lock);
-   }
+   scrub_blocked_if_needed(fs_info);
atomic_dec(fs_info-scrubs_paused);
mutex_unlock(fs_info-scrub_lock);
+
wake_up(fs_info-scrub_pause_wait);

btrfs_put_block_group(cache);
@@ -2926,7 +2926,13 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
}
sctx-readonly = readonly;
dev-scrub_device = sctx;
+   mutex_unlock(fs_info-fs_devices-device_list_mutex);

+   /*
+* checking @scrub_pause_req here, we can avoid
+* race between committing transaction and scrubbing.
+*/
+   scrub_blocked_if_needed(fs_info

Re: 2 errors when scrubbing - but I don't know what they mean

2013-12-01 Thread Sebastian Ochmann

Hello,

 However, if you find such superblocks checksum mismatch very often
 during scrub, it maybe
 there are something wrong with disk!

I'm sorry, but I don't think there's a problem with my disks because I 
was able to trigger the errors that increment the gen error counter 
during scrub on a completely different machine and drive today. I 
basically performed some I/O operations on a drive and scrubbed at the 
same time over and over again until I actually saw super errors during 
scrub. But the error is reeally hard to trigger. It seems to me like a 
race condition somewhere.


So I went a step further and tried to create a repro for this. It seems 
like I can trigger the errors now once every few minutes with the method 
described below, but sometimes it really takes a long time until the 
error pops up, so be patient when trying this...


For the repro:

I'm using a btrfs image in RAM for this for two reasons: I can scrub 
quickly over and over again and I can rule our hard drive errors. My 
machine has 32 GB of RAM, so that comes in handy here - if you try this 
on a physical drive, make sure to adjust some parameters, if necessary.


Create a tmpfs and a testing image, format as btrfs:

$ mkdir btrfstest
$ cd btrfstest/
$ mkdir tmp
$ mount -t tmpfs -o size=20G none tmp
$ dd if=/dev/zero of=tmp/vol bs=1G count=19
$ mkfs.btrfs tmp/vol
$ mkdir mnt
$ mount -o commit=1 tmp/vol mnt

Note the commit=1 mount option. It's not strictly necessary, but I 
have the feeling it helps with triggering the problem...


So now we have a 19 GB btrfs filesystem in RAM, mounted in mnt. What I 
did for performing some artificial I/O operations is to rm and cp a 
linux source tree over and over again. Suppose you have an unpacked 
linux source tree available in the /somewhere/linux directory (and 
you're using bash). We'll spawn some loops that keep the filesystem busy:


$ while true; do rm -fr mnt/a; sleep 1.0; cp -R /somewhere/linux mnt/a; 
sleep 1.0; done
$ while true; do rm -fr mnt/b; sleep 1.1; cp -R /somewhere/linux mnt/b; 
sleep 1.1; done
$ while true; do rm -fr mnt/c; sleep 1.2; cp -R /somewhere/linux mnt/c; 
sleep 1.2; done


Now that the filesystem is busy, we'll also scrub it repeatedly (without 
backgrounding, -B):


$ while true; do btrfs scrub start -B mnt; sleep 0.5; done

On my machine and in RAM, each scrub takes 0-1 second and the total 
bytes scrubbed should fluctuate (seems to be especially true with 
commit=1, but not sure). Get a beverage of your choice and wait.


(about 10 minutes later)

When I was writing this repro it took about 10 minutes until scrub said:

  total bytes scrubbed: 1.20GB with 2 errors
  error details: super=2
  corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

and in dmesg:

  [15282.155170] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
corrupt 0, gen 1
  [15282.155176] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
corrupt 0, gen 2


After that, scrub is happy again and will continue normally until the 
same errors happen again after a few hundred scrubs or so.


So all in all, the error can be triggered using normal I/O operations 
and scrubbing at the right moments, it seems. Even with a btrfs image in 
RAM, so no hard drive error is possible.


Hope anyone can reproduce this and maybe debug it.

Best regards
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 errors when scrubbing - but I don't know what they mean

2013-11-30 Thread Sebastian Ochmann

Hello,

thank you for your input. I didn't know that btrfs keeps the error 
counters over mounts/reboots, but that's nice.


I'm still trying to figure out how such a generation error may occur in 
the first place. One thing I noticed looking at the btrfs code is that 
the generation error counter will only get incremented in the actual 
scrubbing code (either in scrub_checksum_super or in 
scrub_handle_errored_block, both in scrub.c - please correct me if I'm 
wrong, I'm not a btrfs dev). Also, the dmesg errors I saw were not there 
at boot time, but about 10 minutes after boot which was about the time 
when I started the scrub so I'm pretty sure that it was the scrub that 
detected the errors.


The question remains what can cause superblock/gen errors. Sure it could 
be some read error, but I'd really like to make sure that it's not a 
systematic error. I wasn't able to reproduce it yet though.


Best
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2 errors when scrubbing - but I don't know what they mean

2013-11-28 Thread Sebastian Ochmann

Hello everyone,

when I scrubbed one of my btrfs volumes today, the result of the scrub was:

total bytes scrubbed: 1.27TB with 2 errors
error details: super=2
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

and dmesg said:

btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 2

Can someone please enlighten me what these errors mean (especially the 
super and gen values)? As an additional info: The drive is sometimes 
used in a machine with kernel 3.11.6 and sometimes with 3.12.0, could 
this swapping explain the problem somehow?


Best regards
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html