from:"Holger Hoffstätte"

Re: [PATCH] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-10-26 Thread Holger Hoffstätte


On 10/26/18 13:13, Nikolay Borisov wrote:


+   if (page_start >= offset && page_end <= (offset + bytes - 1) {


fs/btrfs/inode.c: In function 'btrfs_cleanup_ordered_extents':
fs/btrfs/inode.c:140:62: error: expected ')' before '{' token
  if (page_start >= offset && page_end <= (offset + bytes - 1) {
 ~^~
  )
You're welcome :)

Holger

Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Holger Hoffstätte


On 10/10/18 19:44, Chris Murphy wrote:

On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte
 wrote:

On 10/10/18 17:44, Larkin Lowrey wrote:
(..)


About once a week, or so, I'm running into the above situation where
FS seems to deadlock. All IO to the FS blocks, there is no IO
activity at all. I have to hard reboot the system to recover. There
are no error indications except for the following which occurs well
before the FS freezes up:

BTRFS warning (device dm-3): block group 78691883286528 has wrong amount
of free space
BTRFS warning (device dm-3): failed to load free space cache for block
group 78691883286528, rebuilding it now

Do I have any options other the nuking the FS and starting over?



Unmount cleanly & mount again with -o space_cache=v2.


I'm pretty sure you have to umount, and then clear the space_cache
with 'btrfs check --clear-space-cache=v1' and then do a one time mount
with -o space_cache=v2.

But anyway, to me that seems premature because we don't even know
what's causing the problem.


Space cache writeout not honoring errors from the depths below
is not unusual, I think there were some fixes recently which Larkin
likely doesn't have yet. But yeah, I forgot to mention that cache-v2
alone won't really fix the _underlying_ symptoms. It is, however,
vastly more reliable in general.


a. Freezing means there's a kernel bug. Hands down.
b. Is it freezing on the rebuild? Or something else?
c. I think the devs would like to see the output from btrfs-progs
v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in
particular something not related to free space cache.


Apart from performance implications, if only the free space cache
inodes/blocks are borked then the rest will (should) work just fine
and/or be replaced/overwritten eventually.

Well, at least that was the idea. :}

-h

Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Holger Hoffstätte


On 10/10/18 19:25, Larkin Lowrey wrote:

On 10/10/2018 12:04 PM, Holger Hoffstätte wrote:

On 10/10/18 17:44, Larkin Lowrey wrote:
(..)

About once a week, or so, I'm running into the above situation where
FS seems to deadlock. All IO to the FS blocks, there is no IO
activity at all. I have to hard reboot the system to recover. There
are no error indications except for the following which occurs well
before the FS freezes up:

BTRFS warning (device dm-3): block group 78691883286528 has wrong amount of 
free space
BTRFS warning (device dm-3): failed to load free space cache for block group 
78691883286528, rebuilding it now

Do I have any options other the nuking the FS and starting over?


Unmount cleanly & mount again with -o space_cache=v2.


It froze while unmounting. The attached zip is a stack dump captured
via 'echo t > /proc/sysrq-trigger'. A second attempt after a hard
reboot worked.


Trace says freespace cache writeout failed midway while the scsi device
was resetting itself and then went rrrghh. Probably managed to hit
different blocks on the second attempt. So chances are your controller,
disk or something else is broken, dying, or both.
When things have settled and you have verified that r/o mounting works
and is stable, try rescuing the data (when necessary) before scrubbing,
dm-device-checking or whatever you have set up.

-h

Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Holger Hoffstätte


On 10/10/18 17:44, Larkin Lowrey wrote:
(..)

About once a week, or so, I'm running into the above situation where
FS seems to deadlock. All IO to the FS blocks, there is no IO
activity at all. I have to hard reboot the system to recover. There
are no error indications except for the following which occurs well
before the FS freezes up:

BTRFS warning (device dm-3): block group 78691883286528 has wrong amount of 
free space
BTRFS warning (device dm-3): failed to load free space cache for block group 
78691883286528, rebuilding it now

Do I have any options other the nuking the FS and starting over?


Unmount cleanly & mount again with -o space_cache=v2.

-h

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


On 10/08/18 17:46, Hans van Kranenburg wrote:


fs.devices() also looks for dev_items in the chunk tree:

https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481

So, BOOM! you need root.

Or just start a 0, ignore errors and start trying all devids until you
found num_devices amount of them that work, yolo.


Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look
at the entries in /sys/fs/btrfs//devices/ and query them all
directly.

The skeleton btrfs_exporter already responds to http requests and
returns dummy metrics, using the official python client library.
I've found a nice little python sysfs scraper and now only need to
figure out how best to map the btrfs info in sysfs to useful metrics.
The privileged access issue would only have come into play much later,
but it seems I can avoid it after all, which is great.
I'm also (re-)learning python in the process, so that's the actual
thing slowing me down..

-h

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


On 10/08/18 16:40, Hans van Kranenburg wrote:

Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.


That's the tree search ioctl, for reading arbitrary metadata.

The device stats ioctl is IOC_GET_DEV_STATS...


Yeah..OK, that is clear and gave me the hint to ask: why is it
calling this in the first place? And as it turns out [1] is where
it seems to go wrong, as is_block_device() returns 0 (as it should),
fi_args.num_devices is never set (remains 0) and it proceeds to call
everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in
#1712. And that works fine:

1711 if (fi_args->num_devices != 1) {
(gdb) s
1712ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
(gdb) s
1713if (ret < 0) {
(gdb) p ret
$28 = 0
(gdb) p *fi_args
$30 = {
  max_id = 1,
  num_devices = 1,
  fsid = "z%:\371\315\033A\203\267.\200\255;FH\221",
  nodesize = 16384,
  sectorsize = 4096,
  clone_alignment = 4096,
  reserved32 = 0,
  reserved = {0 }
}

It's only when it goes into search_chunk_tree_for_fs_info()
where things fail due to CAP_SYS_ADMIN.

And all this explains the actual bug: when I call btrfs device stats
not on the mountpoint (as I've been trying all this time) but rather
on the device, it works properly right away as regular user:

(gdb) set args device stats /dev/loop0
(gdb) r
Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0", 
fi_args=fi_args@entry=0x7fffd400,
di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652
1652{
(gdb) c
Continuing.
[/dev/loop0].write_io_errs0
[/dev/loop0].read_io_errs 0
[/dev/loop0].flush_io_errs0
[/dev/loop0].corruption_errs  0
[/dev/loop0].generation_errs  0
[Inferior 1 (process 2805) exited normally]

So this is simply a discrepancy in handling a device vs. the device(s)
for a mountpoint.


I can do the device stats ioctl as normal user:

import btrfs
fs = btrfs.FileSystem('/')
btrfs.utils.pretty_print(fs.dev_stats(1))


devid: 1
nr_items: 5
flags: 0
write_errs: 0
read_errs: 0
flush_errs: 0
generation_errs: 0
corruption_errs: 0


That works for me too, and that's actually the important part. \o/
Glad we talked about it. :}

-h

[1] 
https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666

Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


(moving the discussion here from GH [1])

Apparently there is something weird going on with the device stats
ioctls. I cannot get them to work as regular user, while they work
for David. A friend confirms the same issue on his system - no access
as non-root.

So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
debug symbols and stepped into search_chunk_tree_for_fs_info().
Everything is fine, all args are correct, right until:

(gdb) s
1614ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
(gdb) s
1615if (ret < 0)
(gdb) p ret
$4 = -1
(gdb) p search_args
$5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset = 1,
max_offset = 18446744073709551615, min_transid = 0, max_transid = 
18446744073709551615,
min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0, unused2 
= 0,
unused3 = 0, unused4 = 0}, buf = '\000' }

Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.

So why can Dave get his dev stats as unprivileged user?
Does this work for anybody else? And why? :)

cheers
Holger

[1] 
https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190

Monitoring btrfs with Prometheus (and soon OpenMonitoring)

2018-10-07 Thread Holger Hoffstätte




The Prometheus statistics collection/aggregation/monitoring/alerting system
[1] is quite popular, easy to use and will probably be the basis for the
upcoming OpenMetrics "standard" [2].

Prometheus collects metrics by polling host-local "exporters" that respond
to http requests; many such exporters exist, from the generic node_exporter
for OS metrics to all sorts of application-/service-specific varieties.

Since btrfs already exposes quite a lot of monitorable and - more
importantly - actionable runtime information in sysfs it only makes sense
to expose these metrics for visualization & alerting. I noodled over the
idea some time ago but got sidetracked, besides not being thrilled at all
by the idea of doing this in golang (which I *really* dislike).

However, exporters can be written in any language as long as they speak
the standard response protocol, so an alternative would be to use one
of the other official exporter clients. These provide language-native
"mini-frameworks" where one only has to fill in the blanks (see [3]
for examples).

Since the issue just came up in the node_exporter bugtracker [3] I
figured I ask if anyone here is interested in helping build a proper
standalone btrfs_exporter in C++? :D

..just kidding, I'd probably use python (which I kind of don't really
know either :) and build on Hans' python-btrfs library for anything
not covered by sysfs.

Anybody interested in helping? Apparently there are also golang libs
for btrfs [5] but I don't know anything about them (if you do, please
comment on the bug), and the idea of adding even more stuff into the
monolithic, already creaky and somewhat bloated node_exporter is not
appealing to me.

Potential problems wrt. btrfs are access to root-only information,
like e.g. the btrfs device stats/errors in the aforementioned bug,
since exporters are really supposed to run unprivileged due to network
exposure. The S.M.A.R.T. exporter [6] solves this with dual-process
contortions; obviously it would be better if all relevant metrics were
accessible directly in sysfs and not require privileged access, but
forking a tiny privileged process every polling interval is probably
not that bad.

All ideas welcome!

cheers,
Holger

[1] https://www.prometheus.io/
[2] https://openmetrics.io/
[3] https://github.com/prometheus/client_python,
https://github.com/prometheus/client_ruby
[4] https://github.com/prometheus/node_exporter/issues/1100
[5] 
https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427651028
[6] https://github.com/cloudandheat/prometheus_smart_exporter

Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-08-23 Thread Holger Hoffstätte


On 08/22/18 21:51, Liu Bo wrote:

Several structs in btrfs are using rb_first() in a while loop, it'd be
more efficient to do this with rb_first_cached() which has the O(1)
complexity.

This patch set updates five structs which may have a large rb tree in
practice


Great idea, works for me with no bad side effects so far. So:

Tested-by: Holger Hoffstätte 

cheers
Holger

Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Holger Hoffstätte


On 08/21/18 20:15, Liu Bo wrote:

I just realize that patch 2 can result in softlockup as
btrfs_search_slot() may return a path with all nodes being in spinning
lock, and if the callers want to sleep, we're in trouble.  I've
removed patch 2 and am re-running the test (xfstests, fsmark and
dbench).


You mean like this, when trying to balance? :)
Got it only once so far, subsequent attempts worked. Otherwise everything
seems fine.

-h

kernel: BTRFS info (device sdc1): relocating block group 4128424067072 flags 
data
kernel: BTRFS info (device sdc1): found 1706 extents
kernel: INFO: rcu_sched self-detected stall on CPU
kernel: ^I3-: (17999 ticks this GP) idle=f5e/1/4611686018427387906 
softirq=269430/269430 fqs=5999
kernel: ^I (t=18000 jiffies g=232869 c=232868 q=4365)
kernel: NMI backtrace for cpu 3
kernel: CPU: 3 PID: 4287 Comm: kworker/u8:0 Not tainted 4.18.3 #1
kernel: Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 
4105 07/01/2013
kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
kernel: Call Trace:
kernel:  
kernel:  dump_stack+0x46/0x60
kernel:  nmi_cpu_backtrace.cold.0+0x13/0x57
kernel:  ? lapic_can_unplug_cpu.cold.5+0x34/0x34
kernel:  nmi_trigger_cpumask_backtrace+0x8f/0x91
kernel:  rcu_dump_cpu_stacks+0x87/0xb2
kernel:  rcu_check_callbacks.cold.59+0x2ac/0x430
kernel:  ? tick_sched_handle.isra.6+0x40/0x40
kernel:  update_process_times+0x28/0x60
kernel:  tick_sched_handle.isra.6+0x35/0x40
kernel:  tick_sched_timer+0x3b/0x80
kernel:  __hrtimer_run_queues+0xfe/0x270
kernel:  hrtimer_interrupt+0xf4/0x210
kernel:  smp_apic_timer_interrupt+0x56/0x110
kernel:  apic_timer_interrupt+0xf/0x20
kernel:  
kernel: RIP: 0010:queued_write_lock_slowpath+0x4a/0x80
kernel: Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01 00 00 ba ff 00 00 00 
b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03 <3d> 00 01 00 00 75 f5 89 
c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43
kernel: RSP: 0018:c940fc40 EFLAGS: 0206 ORIG_RAX: ff13
kernel: RAX: 0300 RBX: 88071abf86f0 RCX: 0100
kernel: RDX: 00ff RSI: 8800 RDI: 88071abf86f0
kernel: RBP:  R08: 88071abf8690 R09: 880724ebeb58
kernel: R10: 880724ebeb80 R11:  R12: 0001
kernel: R13: 8803d0db5f54 R14: 1600 R15: 0006
kernel:  btrfs_try_tree_write_lock+0x23/0x60 [btrfs]
kernel:  btrfs_search_slot+0x2df/0x970 [btrfs]
kernel:  btrfs_mark_extent_written+0xb0/0xac0 [btrfs]
kernel:  ? kmem_cache_alloc+0x1a5/0x1b0
kernel:  btrfs_finish_ordered_io+0x2e2/0x7a0 [btrfs]
kernel:  normal_work_helper+0xad/0x2c0 [btrfs]
kernel:  process_one_work+0x1e3/0x390
kernel:  worker_thread+0x2d/0x3c0
kernel:  ? process_one_work+0x390/0x390
kernel:  kthread+0x111/0x130
kernel:  ? kthread_flush_work_fn+0x10/0x10
kernel:  ret_from_fork+0x1f/0x30

Re: Regression with crc32c selection?

2018-07-23 Thread Holger Hoffstätte


On 07/23/18 18:50, David Sterba wrote:

On Mon, Jul 23, 2018 at 04:13:26PM +0200, Holger Hoffstätte wrote:

While backporting a bunch of fixes to my own 4.16.x tree
(4.17 had a few too many bugs for my taste) I also ended up merging:


Curious, bugs in btrfs or the whole 4.17 kernel? And if bugs, real
breakage or backported fixes?


Overall. I don't remember specifics but skimming lkml at the time
didn't inspire a lot of confidence, and since I already had a large
number of hand-picked & backported patches from 4.17/4.18/4.19 :) for
btrfs, xfs, net, blk-mq & drivers - just the stuff I care about - I skipped
it instead of upgrading & rebasing everything. Might well be that the latest
4.17-stable works reliably, but 4.18 is already around the corner, so..
no really good reason. :)

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regression with crc32c selection? (solved - pilot error)

2018-07-23 Thread Holger Hoffstätte


On 07/23/18 16:39, Patrik Lundquist wrote:

$ uname -a
Linux nas 4.17.0-1-amd64 #1 SMP Debian 4.17.8-1 (2018-07-20) x86_64 GNU/Linux

$ dmesg | grep Btrfs
[8.168408] Btrfs loaded, crc32c=crc32c-intel

$ lsmod | grep crc32
crc32_pclmul   16384  0
libcrc32c  16384  1 btrfs
crc32c_generic 16384  0
crc32c_intel   24576  2


Ooohh..thanks for that. I wouldn't be surprised it's because my libcrc
is built-in (probably because of built-in xfs) and at initialization time
doesn't see any modules, so it always selects the generic impl.
Since btrfs is only ever loaded last, the previous btrfs init code could
properly detect/load/use the crc32c-intel module.

I switched the crc32c impls to built-in and what do you know:

Jul 23 17:04:26 ragnarok kernel: Btrfs loaded, crc32c=crc32c-intel

\o/

Thanks Patrick!

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

python-btrfs & btrfs-heatmap ebuilds now available for Gentoo

2018-07-23 Thread Holger Hoffstätte




I wanted to migrate my collection of questionable shell scripts for btrfs
maintenance/inspection to a more stable foundation and therefore created
Gentoo ebuilds for Hans van Kranenburg's excellent python-btrfs [1] and
btrfs-heatmap [2] packages.

They can be found in my overlay at:

https://github.com/hhoffstaette/portage/tree/master/sys-fs/python-btrfs
https://github.com/hhoffstaette/portage/tree/master/sys-fs/btrfs-heatmap

Both are curently only available for python-3.6 since a change in 3.7
broke the existing python API [3]. As soon as that is fixed I'll add 3.7
to the PYTHON_COMPAT entries.

btrfs-heatmap properly uses the python-exec wrapper and therefore works
regardless of currently selected default python version.  :)

I hope this is useful to someone.

cheers,
Holger

[1] https://github.com/knorrie/python-btrfs
[2] https://github.com/knorrie/btrfs-heatmap
[3] https://github.com/knorrie/python-btrfs/issues/13
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Regression with crc32c selection?

2018-07-23 Thread Holger Hoffstätte


Hi,

While backporting a bunch of fixes to my own 4.16.x tree
(4.17 had a few too many bugs for my taste) I also ended up merging:

df91f56adce1f: libcrc32c: Add crc32c_impl function
9678c54388b6a: btrfs: Remove custom crc32c init code

..which AFAIK went into 4.17 and seemed harmless enough; after fixing up
a trivial context conflict it builds, runs, all good..except that btrfs
(apprently?) no longer uses the preferred crc32c-intel module, but the
crc32c-generic one instead.

In order to rule out any mistakes on my part I built 4.18.0-rc6 and it
seems to have the same problem:

Jul 23 15:55:09 ragnarok kernel: raid6: sse2x1   gen() 11267 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: sse2x1   xor()  8110 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: sse2x2   gen() 13409 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: sse2x2   xor()  9137 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: sse2x4   gen() 15884 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: sse2x4   xor() 10579 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6: using algorithm sse2x4 gen() 15884 MB/s
Jul 23 15:55:09 ragnarok kernel: raid6:  xor() 10579 MB/s, rmw enabled
Jul 23 15:55:09 ragnarok kernel: raid6: using ssse3x2 recovery algorithm
Jul 23 15:55:09 ragnarok kernel: xor: automatically using best checksumming 
function   avx
Jul 23 15:55:09 ragnarok kernel: Btrfs loaded, crc32c=crc32c-generic

I understand that the new crc32c_impl() function changed from
crypto_tfm_alg_driver_name() to crypto_shash_driver_name() - could this
be the reason? The module is loaded just fine, but apprently not used:

$lsmod | grep crc32
crc32_pclmul   16384  0
crc32c_intel   24576  0

In other words, is this supposed to happen or is my kernel config somehow
no longer right? It worked before and doesn't look too wrong:

$grep CRC /etc/kernels/kernel-config-x86_64-4.18.0-rc6
# CONFIG_PCIE_ECRC is not set
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=m
CONFIG_CRYPTO_CRC32=m
CONFIG_CRYPTO_CRC32_PCLMUL=m
# CONFIG_CRYPTO_CRCT10DIF is not set
CONFIG_CRC_CCITT=m
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set

Ultimately btrfs (and everything else) works, but the process of how
the kernel selects a crc32c implementation seems rather mysterious to me. :/

Any insights welcome. If it's a regression I can gladly test fixes.

cheers
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Issue on BTRFS/copy of really huge files

2018-07-06 Thread Holger Hoffstätte


On 07/06/18 10:37, Juergen Sauer wrote:
..

Moving a virtual machine from ssd/raid1 subvolume (nocow) into the
rotational big store (noocow) fails.
After filling up the cachememory (ram) the data flow cuts down to zero
0 kb/sec.
In fatal result the copy of an huge file hangs does not proceed any
more, load raises infinite, iops falling to zero. In kernel log I find:

[  491.151952] INFO: task kworker/u16:28:1027 blocked for more than 120
seconds.
[  491.151953]   Tainted: P   O  4.17.3-1-ARCH #1
[  491.151953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  491.151953] kworker/u16:28  D0  1027  2 0x8000
[  491.151965] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper
[btrfs]
[  491.151965] Call Trace:
[  491.151967]  ? __schedule+0x282/0x890
[  491.151969]  schedule+0x32/0x90
[  491.151970]  io_schedule+0x12/0x40
[  491.151971]  blk_mq_get_tag+0x146/0x2a0


This has nothing to do with btrfs and is simply one of the remaining
(but already fixed upstream) bugs in the blk-mq stack, probably related
to sbitmap concurrency and or "tag starvation".
I could give you a list of patches from 4.18+ that help (reliably)
but I suppose you're not into kernel patching, so the easiest way for
you would be to to switch to the old block layer (e.g. by booting
with kernel flag scsi_mod.use_blk_mq=0) and use deadline/cfq as before.

This should all be fixed & work reliable with 4.18+; it looks that by
4.19 blk-mq will also be enabled by default.

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"

2018-06-04 Thread Holger Hoffstätte


On 06/04/18 15:29, Hans van Kranenburg wrote:

Hi,

On 06/04/2018 03:14 PM, Russell Coker wrote:

The command "reboot -nffd" (kernel reboot without flushing kernel buffers or
writing status) when run on a BTRFS system with SE Linux will often result in
/var/log/audit/audit.log being unlabeled.


This recent fix might be what you're looking for:

https://www.spinics.net/lists/linux-btrfs/msg77927.html


..which has been in 4.16 since 4.16.11. :)

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] btrfs: Speedup btrfs_read_block_groups()

2018-02-22 Thread Holger Hoffstätte

On 02/22/18 05:52, Qu Wenruo wrote:
> btrfs_read_block_groups() is used to build up the block group cache for
> all block groups, so it will iterate all block group items in extent
> tree.
> 
> For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
> thousands times, which is the most time consuming part of mounting
> btrfs.
> 
> So this patch will try to speed it up by:
> 
> 1) Avoid unnecessary readahead
>We were using READA_FORWARD to search for block group item.
>However block group items are in fact scattered across quite a lot of
>leaves. Doing readahead will just waste our IO (especially important
>for HDD).
> 
>In real world case, for a filesystem with 3T used space, it would
>have about 50K extent tree leaves, but only have 3K block group
>items. Meaning we need to iterate 16 leaves to meet one block group
>on average.
> 
>So readahead won't help but waste slow HDD seeks.
> 
> 2) Use chunk mapping to locate block group items
>Since one block group item always has one corresponding chunk item,
>we could use chunk mapping to get the block group item size.
> 
>With block group item size, we can do a pinpoint tree search, instead
>of searching with some uncertain value and do forward search.
> 
>In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
>current path, we could save such unnecessary tree block read.
> 
> Cc: Ellis H. Wilson III 
> Signed-off-by: Qu Wenruo 
> ---
> Since all my TB level storage is all occupied by my NAS, any feedback
> (especially for the real world mount speed change) is welcome.

(sorry for the previous mail without results..finger salad)

Decided to give this a try & got some nice results!
Probably not on the same scale & nonlinear behaviour as Ellis will
provide since I just don't have that much storage, but interesting
nevertheless.

$btrfs filesystem df /mnt/backup 
Data, single: total=1.10TiB, used=1.09TiB
System, DUP: total=32.00MiB, used=144.00KiB
Metadata, DUP: total=4.00GiB, used=2.23GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l
1137

current kernel (4.14++ with most of blk-mq+BFQ from 4.16):
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.211 total
mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.122 total
mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.236 total

patched:
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.070 total
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.056 total
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.058 total

That's not overwhelming, but still measurable and nice to have!

While I was at it I decided to fill up the drive to almost-max
~3.7TB and see how much slower it woulöd get...you won't believe
what happened next. :-)

$btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l
3719

mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.328 total
mount /mnt/backup  0.00s user 0.03s system 2% cpu 1.361 total
mount /mnt/backup  0.00s user 0.03s system 2% cpu 1.368 total

Over three times the data, almost the same mount time as before?
Yes please!

Overall this looks like a really nice improvement. Glad to see
that my suspicion about the (non)usefulness of the readhead turned
out to be true. :)

cheers,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] btrfs: Speedup btrfs_read_block_groups()

2018-02-22 Thread Holger Hoffstätte

On 02/22/18 05:52, Qu Wenruo wrote:
> btrfs_read_block_groups() is used to build up the block group cache for
> all block groups, so it will iterate all block group items in extent
> tree.
> 
> For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
> thousands times, which is the most time consuming part of mounting
> btrfs.
> 
> So this patch will try to speed it up by:
> 
> 1) Avoid unnecessary readahead
>We were using READA_FORWARD to search for block group item.
>However block group items are in fact scattered across quite a lot of
>leaves. Doing readahead will just waste our IO (especially important
>for HDD).
> 
>In real world case, for a filesystem with 3T used space, it would
>have about 50K extent tree leaves, but only have 3K block group
>items. Meaning we need to iterate 16 leaves to meet one block group
>on average.
> 
>So readahead won't help but waste slow HDD seeks.
> 
> 2) Use chunk mapping to locate block group items
>Since one block group item always has one corresponding chunk item,
>we could use chunk mapping to get the block group item size.
> 
>With block group item size, we can do a pinpoint tree search, instead
>of searching with some uncertain value and do forward search.
> 
>In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
>current path, we could save such unnecessary tree block read.
> 
> Cc: Ellis H. Wilson III 
> Signed-off-by: Qu Wenruo 

Decided to give this a try & got interesting results!

The scene of the crime:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Holger Hoffstätte

On 02/13/18 13:54, Qu Wenruo wrote:
> On 2018年02月13日 20:26, Holger Hoffstätte wrote:
>> On 02/13/18 12:40, Qu Wenruo wrote:
>>>>> The problem is not about how much space it takes, but how many extents
>>>>> are here in the filesystem.
>>
>> I have no idea why btrfs' mount even needs to touch all block groups to
>> get going (which seems to be the root of the problem), but here's a
>> not so crazy idea for more "mechanical sympathy". Feel free to mock
>> me if this is terribly wrong or not possible. ;)
>>
>> Mounting of even large filesystems (with many extents) seems to be fine
>> on SSDS, but not so fine on rotational storage. We've heard that from
>> several people with large (multi-TB) filesystems, and obviously it's
>> even more terrible on 5400RPM drives because their seeks are sooo sloow.
>>
>> If the problem is that the bgs are touched/iterated in "tree order",
>> would it then not be possible to sort the block groups in physical order
>> before trying to load whatever mount needs to load?
> 
> This is in fact a good idea.
> Make block group into its own tree.

Well, that's not what I was thinking about at all..yet. :)
(keep in mind I'm not really that familiar with the internals).

Out of curiosity I ran a bit of perf on my own mount process, which is
fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and
small files. Unfortunately it's also very fresh since I recreated it just
this weekend, so everything is neatly packed together and fast.

In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty
old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s.

My perf run shows that the only interesting part responsible for mount time
is the nested loop in btrfs_read_block_groups calling find_first_block_group
(which got inlined & is not in the perf callgraph) over and over again,
accounting for 75% of time spent.

I now understand your comment why the real solution to this problem
is to move bgs into their own tree, and agree: both kitchens and databases
have figured out a long time ago that the key to fast scan and lookup
performance is to not put different things in the same storage container;
in the case of analytical DBMS this is columnar storage. :)

But what I originally meant was something much simpler and more
brute-force-ish. I see that btrfs_read_block_groups adds readahead
(is that actually effective?) but what I was looking for was the equivalent
of a DBMS' sequential scan. Right now finding (and loading) a bg seems to
involve a nested loop of tree lookups. It seems easier to rip through the
entire tree in nice 8MB chunks and discard what you don't need instead
of seeking around trying to find all the right bits in scattered order.

Could we alleviate cold mounts by starting more readaheads in
btrfs_read_block_groups, so that the extent tree is scanned more linearly?

cheers,
Holger

signature.asc
Description: OpenPGP digital signature

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2018-02-13 Thread Holger Hoffstätte

On 02/13/18 12:40, Qu Wenruo wrote:
>>> The problem is not about how much space it takes, but how many extents
>>> are here in the filesystem.

I have no idea why btrfs' mount even needs to touch all block groups to
get going (which seems to be the root of the problem), but here's a
not so crazy idea for more "mechanical sympathy". Feel free to mock
me if this is terribly wrong or not possible. ;)

Mounting of even large filesystems (with many extents) seems to be fine
on SSDS, but not so fine on rotational storage. We've heard that from
several people with large (multi-TB) filesystems, and obviously it's
even more terrible on 5400RPM drives because their seeks are sooo sloow.

If the problem is that the bgs are touched/iterated in "tree order",
would it then not be possible to sort the block groups in physical order
before trying to load whatever mount needs to load? That way the entire
process would involve less seeking (no backward seeks for one) and the
drive could very likely get more done during a rotation before stepping
further.

cheers,
Holger

signature.asc
Description: OpenPGP digital signature

Re: Can rsync use reflink to synchronize directories?

2018-01-17 Thread Holger Hoffstätte

On 01/17/18 18:35, MegaBrutal wrote:
> Does rsync support copying files with reflink on local btrfs file
> system? Of course it could only work if the necessary conditions for
> reflinking are met, but it would be very useful.

Not yet: https://bugzilla.samba.org/show_bug.cgi?id=10170

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-08 Thread Holger Hoffstätte

On 01/09/18 00:27, Holger Hoffstätte wrote:
> On 01/08/18 23:55, Jens Axboe wrote:
>> the good old
>>
>> int srcu_idx = srcu_idx;
>>
>> should get the job done.
> 
> (Narrator: It didn't.)

Narrator: we retract our previous statement and apologize for the
confusion. It works fine when you correctly replace all occurrences,
not just one. <:)

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-08 Thread Holger Hoffstätte

On 01/08/18 20:15, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-mq.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ddc9261..6741c3e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
> *srcu_idx)
>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> + int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
> +
> + hctx_lock(hctx, _idx);
>   if (!blk_mark_rq_complete(rq))
>   __blk_mq_complete_request(rq);
> + hctx_unlock(hctx, srcu_idx);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);

So I've had v3 running fine with 4.14++ and when I first tried Jens'
additional helpers on top, I got a bunch of warnings which I didn't
investigate further at the time. Now they are back since the helpers
moved into patch #1 and #2 correctly says:

..
block/blk-mq.c: In function ‘blk_mq_complete_request’:
./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in 
this function [-Wmaybe-uninitialized]
  __srcu_read_unlock(sp, idx);
  ^~~
block/blk-mq.c:587:6: note: ‘srcu_idx’ was declared here
  int srcu_idx;
  ^~~~
..etc.

This is with gcc 7.2.0.

I understand that this is a somewhat-false positive since the lock always
precedes the unlock & writes to the value, but can we properly initialize
or annotate this?

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 1 PID: 3016 at fs/btrfs/ctree.h:1564 btrfs_update_device+0x189/0x190 [btrfs]

2017-12-29 Thread Holger Hoffstätte


Apply the patch from https://patchwork.kernel.org/patch/9960893/
and follow the logged instructions re. device resizing (or see
https://bugzilla.kernel.org/show_bug.cgi?id=196949 for examples).

The patch is unfortunately not yet merged into 4.15rc, otherwise it
could be sent to 4.14-stable.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel hangs during balance

2017-12-20 Thread Holger Hoffstätte

On 12/20/17 20:02, Chris Murphy wrote:
> I don't know if it's the sending MUA or the list server, but the line
> wrapping makes this much harder to follow. I suggest putting it in a
> text file and attaching the text file. It's definitely not on the
> receiving side, I see it here also:
> https://www.spinics.net/lists/linux-btrfs/msg72872.html

You can see enough to suggest that blk-mq is hanging, which is
"unsurprising" (being kind here) with such an old kernel.
Rich, build your kernel with CONFIG_SCSI_MQ_DEFAULT=n or
boot with scsi_mod.use_blk_mq=n as kernel prarameter.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-transacti hammering the system

2017-12-01 Thread Holger Hoffstätte

On 12/01/17 18:34, Matt McKinnon wrote:
> Thanks, I'll give space_cache=v2 a shot.

Yes, very much recommended.

> My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/

Turn autodefrag off and use noatime instead of relatime.

Your filesystem also seems very full, that's bad with every filesystem but
*especially* with btrfs because the allocator has to work really hard to find
free space for COWing. Really consider deleting stuff or adding more space.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2.1 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2017-11-29 Thread Holger Hoffstätte

On Tue, 28 Nov 2017 15:20:39 +0800, Qu Wenruo wrote:

> [BUG]
> fstrim on some btrfs only trims the unallocated space, not trimming any
> space in existing block groups.
> 
> [CAUSE]
> Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
> range [0, super->total_bytes).
> So later btrfs_trim_fs() will only be able to trim block groups in range
> [0, super->total_bytes).
> 
> While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
> its logical address space, there is nothing limiting the location where
> we put block groups.
> 
> For btrfs with routine balance, it's quite easy to relocate all
> block groups and bytenr of block groups will start beyond super->total_bytes.
> 
> In that case, btrfs will not trim existing block groups.
> 
> [FIX]
> Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
> can get the unmodified range, which is normally set to [0, U64_MAX].
> 
> Reported-by: Chris Murphy 
> Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
> ioctl")
> Cc:  # v4.0+
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation
>   so we can still allow user to trim custom range.
> v2.1:
>   Include the missing change in btrfs_trim_fs() and update the commit
>   message to reflect this.
> ---
>  fs/btrfs/extent-tree.c |  9 +
>  fs/btrfs/ioctl.c   | 11 +++
>  2 files changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f830aa91ac3d..f74958e11008 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10972,14 +10972,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   int dev_ret = 0;
>   int ret = 0;
>  
> - /*
> -  * try to trim all FS space, our block group may start from non-zero.
> -  */
> - if (range->len == total_bytes)
> - cache = btrfs_lookup_first_block_group(fs_info, range->start);
> - else
> - cache = btrfs_lookup_block_group(fs_info, range->start);
> -
> + cache = btrfs_lookup_first_block_group(fs_info, range->start);
>   for (; cache; cache = next_block_group(fs_info, cache)) {
>   if (cache->key.objectid >= (range->start + range->len)) {
>   btrfs_put_block_group(cache);
(..snip..)

This first hunk should also remove total_bytes, otherwise we get an
unused-variable warning:

@@ -10972,19 +10972,11 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
struct fstrim_range *range)
u64 start;
u64 end;
u64 trimmed = 0;
-   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
int bg_ret = 0;
int dev_ret = 0;
int ret = 0;
..etc..
 
cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting

2017-11-22 Thread Holger Hoffstätte

On 11/21/17 23:22, Lionel Bouton wrote:
> Le 21/11/2017 à 23:04, Andy Leadbetter a écrit :
>> I have a 4 disk array on top of 120GB bcache setup, arranged as follows
> [...]
>> Upgraded today to 4.14.1 from their PPA and the
> 
> 4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example
> :
> https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/

4.14.2 (just out as rc1) will have the fix.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-17 Thread Holger Hoffstätte

On 11/17/17 18:48, Marc MERLIN wrote:
> On Thu, Nov 16, 2017 at 09:53:15PM -0800, Marc MERLIN wrote:
>>> I suggest that you try lvmcache instead. It's much more flexible than 
>>> bcache,
>>> does pretty much the same job, and has much less of the "hacky" feel to it.
>>
>> I can read up on it, it's going to be a big pain to convert from one to
>> the other, but I can look at it for new filesystems.
> 
> I had a quick read. As expected, it's slower since it goes through all
> the LVM overhead that I got rid of recently
> https://github.com/stec-inc/EnhanceIO/wiki/PERFORMANCE-COMPARISON-AMONG-dm-cache,-bcache-and-EnhanceIO
> 
> Given the pain it would be for me to switch, I'm going to stick with
> bcache and hope it improves.

My understanding is that bcache until recently was more or less unmaintained,
but got quite a few patches for 4.14 and now has a new maintainer as well.
So things are looking up.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-16 Thread Holger Hoffstätte

On 11/16/17 22:45, Marc MERLIN wrote:
(snip)
>> This BUG() was recently removed and seems to be caused by some kind
>> of persistent corruption, which is seen as invalid inline extent.
>> See [1], [2] for details. Maybe you can backport them?
>> Alternatively just give 4.14 a whirl, it's great.
>>
>> -h
>>
>> [1] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745
>> [2] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4
> 
> First thanks a lot for the quick reply, it was super timely considering
> my server was rebooting every 20mn :)
> I've now been running 4.14 for a couple of hours, and things seem ok
> btrfs-wise.

Don't pop the champagne just yet, I just read that apprently 4.14 broke
bcache for some people [1]. Not sure how much that affects you, but it might
well make things worse. Yeah, I know, wonderful.

> So, just so that I understand:
> 1) I do have some kind of FS problem/corruption (minor? major?)

All I know is what's in those commits, I just remembered the description. ;)
If I understand the patches correctly you're still supposed to get an
"invalid extent inline ref type" message.

> 2) it started crashing 4.9.36 and then 4.13 today, every 20mn, probably due 
> to some background
> cleaner process that kept starting and hitting the problem spot

Sounds like.

> 3) 4.14 does not crash anymore, but it doesn't even report any problem 
> either. Does it mean
> the error that crashed the old kernel is minor enough that the new kernel 
> doesn't bother even
> logging it?

See above, you should still get a warning. OTOH it's hard to tell what is
going on when you seem to have dm/dmcrypt/bcache lasagne going on..

> 4) I just ran scrub on the filesystem and it ran fine.

That's not too depressing. :)

> I'm asusming that running btrfs check --force on a mounted filesystem
> that is being used is not going to give useful results, unless I leave
> the FS read only. Correct?

Think so, yes.

> As for 4.14, the serial console code seems broken though, I can't get login 
> or bash
> to work anymore on them:
> [ 2786.305004] INFO: task login:5636 blocked for more than 120 seconds.
> [ 2786.324648]   Tainted: G U  W   
> 4.14.0-amd64-stkreg-sysrq-20171018 #1
> [ 2786.347692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 2786.371742] login   D0  5636  1 0xa0020006

I'm out. :/

-h

[1] https://marc.info/?t=15108212601
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-16 Thread Holger Hoffstätte

On 11/16/17 18:07, Marc MERLIN wrote:
> Sorry, was missing the kernel number in the subject, just fixed that.
> 
> On Thu, Nov 16, 2017 at 09:04:45AM -0800, Marc MERLIN wrote:
>> My server now reboots every 20mn or so, with this.
>> Sadly another BUG_ON() and it won't even tell me which filesystem
>> it's on
>>
>> static inline u32 btrfs_extent_inline_ref_size(int type)
>> {
>>  if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>>  type == BTRFS_SHARED_BLOCK_REF_KEY)
>>  return sizeof(struct btrfs_extent_inline_ref);
>>  if (type == BTRFS_SHARED_DATA_REF_KEY)
>>  return sizeof(struct btrfs_shared_data_ref) +
>> sizeof(struct btrfs_extent_inline_ref);
>>  if (type == BTRFS_EXTENT_DATA_REF_KEY)
>>  return sizeof(struct btrfs_extent_data_ref) +
>> offsetof(struct btrfs_extent_inline_ref, offset);
>>  BUG();
>>  return 0;
>> }

This BUG() was recently removed and seems to be caused by some kind
of persistent corruption, which is seen as invalid inline extent.
See [1], [2] for details. Maybe you can backport them?
Alternatively just give 4.14 a whirl, it's great.

-h

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: add option to only list parent subvolumes

2017-09-30 Thread Holger Hoffstätte

On 09/30/17 19:56, Holger Hoffstätte wrote:
> shell hackery as alternative. Anyway, I was sure that at the time the
> other letters sounded even worse/were taken, but that may just have been
> in my head. ;-)
> 
> I just rechecked and -S is still available, so that's good.

Except that it isn't really, since there is already an 'S'
case in cmds-subvolume.c as shortcut to --sort:

--
case 'S':
ret = btrfs_list_parse_sort_string(optarg,
   _set);
--

..which is why I picked P at the time.

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: add option to only list parent subvolumes

2017-09-30 Thread Holger Hoffstätte

On 09/30/17 18:37, Graham Cobb wrote:
> On 30/09/17 14:08, Holger Hoffstätte wrote:
>> A "root" subvolume is identified by a null parent UUID, so adding a new
>> subvolume filter and flag -P ("Parent") does the trick. 
> 
> I don't like the naming. The flag you are proposing is really nothing to
> do with whether a subvolume is a parent or not: it is about whether it
> is a snapshot or not (many subvolumes are both snapshots and also
> parents of other snapshots, and many non-snapshots are not the parent of
> any subvolumes).

You're completely right. I wrote that patch about a year ago because I
needed it for my own homegrown backup program and spent approx. 5 seconds
finding a free option letter; ultimately I ended up using the mentioned
shell hackery as alternative. Anyway, I was sure that at the time the
other letters sounded even worse/were taken, but that may just have been
in my head. ;-)

I just rechecked and -S is still available, so that's good.

> I have two suggestions:
> 
> 1) Use -S (meaning "not a snapshot", the inverse of -s). Along with this
> change. I would change the usage text to say something like:
> 
>  -s list subvolumes originally created as snapshots
>  -S list subvolumes originally created not as snapshots
> 
> Presumably specifying both -s and -S should be an error.

That sounds much better indeed and is a quick fix. While I agree
that the "-P /none" filter would be useful too, it's also
a different feature and more work than I want to invest in this
right now. Maybe later "-S" can simply delegate to "-P none".

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: right-align number columns in btrfs-debugfs output

2017-09-30 Thread Holger Hoffstätte


The values for block group offset, length etc. in btrfs-debugfs' output
are left-aligned, which creates unaligned output and makes the usage
percentage hard to read/process further. This patch adds right-aligning
format specifiers for the number values.
Ideally the format values wouldn't be hardcoded but instead derived from
the filesystem size, but this seems to work for now.

Signed-off-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
---
 btrfs-debugfs | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/btrfs-debugfs b/btrfs-debugfs
index dfb88539..a7ecd16d 100755
--- a/btrfs-debugfs
+++ b/btrfs-debugfs
@@ -356,8 +356,13 @@ def print_block_groups(mountpoint):
 
 ctypes.memmove(ctypes.addressof(bg), p, ctypes.sizeof(bg))
 if bg.flags & BTRFS_BLOCK_GROUP_DATA:
-print "block group offset %Lu len %Lu used %Lu 
chunk_objectid %Lu flags %Lu usage %.2f" %\
- (header.objectid, header.offset, bg.used, 
bg.chunk_objectid, bg.flags, float(bg.used) / float(header.offset))
+print "block group offset %s len %s used %s chunk_objectid 
%Lu flags %Lu usage %.2f" %\
+ ('{:>14}'.format(header.objectid),
+  '{:>10}'.format(header.offset),
+  '{:>10}'.format(bg.used),
+  bg.chunk_objectid,
+  bg.flags,
+  float(bg.used) / float(header.offset))
 
 total_free += (header.offset - bg.used)
 if min_used >= bg.used:
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: add option to only list parent subvolumes

2017-09-30 Thread Holger Hoffstätte

Hi,

When listing subvolumes it can be useful to filter out any snapshots;
surprisingly enough I couldn't find an option to do this easily, only
the opposite (list only snapshots).

A "root" subvolume is identified by a null parent UUID, so adding a new
subvolume filter and flag -P ("Parent") does the trick. 

The same can of course be accomplished with shell hackery, e.g.:

 btrfs subvol list -q -o  | grep -i "parent_uuid -" | cut -d " " -f 11

but a built-in way seems less fragile.

I originally cooked this up for myself, but David Sterba encouraged me to
send this to the list, so here it is. I'm not too proud of the -P but
couldn't find a better option letter; suggestions welcome. :)

cheers,
Holger

Signed-off-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
---
 btrfs-list.c | 6 ++
 btrfs-list.h | 1 +
 cmds-subvolume.c | 8 +++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/btrfs-list.c b/btrfs-list.c
index 4cc2ed49..6aa7a290 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -1175,6 +1175,11 @@ static int filter_deleted(struct root_info *ri, u64 data)
return ri->deleted;
 }
 
+static int filter_parent_subvol_only(struct root_info *ri, u64 data)
+{
+   return uuid_is_null(ri->puuid);
+}
+
 static btrfs_list_filter_func all_filter_funcs[] = {
[BTRFS_LIST_FILTER_ROOTID]  = filter_by_rootid,
[BTRFS_LIST_FILTER_SNAPSHOT_ONLY]   = filter_snapshot,
@@ -1189,6 +1194,7 @@ static btrfs_list_filter_func all_filter_funcs[] = {
[BTRFS_LIST_FILTER_FULL_PATH]   = filter_full_path,
[BTRFS_LIST_FILTER_BY_PARENT]   = filter_by_parent,
[BTRFS_LIST_FILTER_DELETED] = filter_deleted,
+   [BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY]  = filter_parent_subvol_only,
 };
 
 struct btrfs_list_filter_set *btrfs_list_alloc_filter_set(void)
diff --git a/btrfs-list.h b/btrfs-list.h
index 13f44c3a..54aab123 100644
--- a/btrfs-list.h
+++ b/btrfs-list.h
@@ -142,6 +142,7 @@ enum btrfs_list_filter_enum {
BTRFS_LIST_FILTER_FULL_PATH,
BTRFS_LIST_FILTER_BY_PARENT,
BTRFS_LIST_FILTER_DELETED,
+   BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY,
BTRFS_LIST_FILTER_MAX,
 };
 
diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index e7ef67d3..2371338e 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -404,6 +404,7 @@ static const char * const cmd_subvol_list_usage[] = {
"-q   print the parent uuid of the snapshots",
"-R   print the uuid of the received snapshots",
"-t   print the result as a table",
+   "-P   list parent subvolumes only",
"-s   list snapshots only in the filesystem",
"-r   list readonly subvolumes (including snapshots)",
"-d   list deleted subvolumes that are not yet cleaned",
@@ -445,7 +446,7 @@ static int cmd_subvol_list(int argc, char **argv)
};
 
c = getopt_long(argc, argv,
-   "acdgopqsurRG:C:t", long_options, NULL);
+   "acdgopPqsurRG:C:t", long_options, NULL);
if (c < 0)
break;
 
@@ -473,6 +474,11 @@ static int cmd_subvol_list(int argc, char **argv)
case 't':
is_tab_result = 1;
break;
+   case 'P':
+   btrfs_list_setup_filter(_set,
+   
BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY,
+   0);
+   break;
case 's':
btrfs_list_setup_filter(_set,
BTRFS_LIST_FILTER_SNAPSHOT_ONLY,
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at fs/btrfs/extent_io.c:1989

2017-09-18 Thread Holger Hoffstätte

On 09/18/17 19:09, Liu Bo wrote:
> This 'mirror 0' looks fishy, (as mirror comes from
> btrfs_io_bio->mirror_num, which should be at least 1 if raid1 setup is
> in use.)
> 
> Not sure if 4.13.2-gentoo made any changes on btrfs, but can you

No, it did not; Gentoo always strives to be as close to mainline as
possible except for urgent security & low-risk convenience fixes.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks

2017-09-18 Thread Holger Hoffstätte


Hello, quick question for backporting..

On 09/15/17 23:06, Liu Bo wrote:
> commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> changed the logic of how dio read endio reports errors.
[snip]

I've tried to merge this into my 4.9.x++ tree but have a question since
the DIO APIs changed recently and itt's hard to tell what is a bug
and what is a feature.. :-/

> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8267,11 +8267,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>   blk_status_t err = bio->bi_status;
>  
> - if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) {
> + if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
>   err = btrfs_subio_endio_read(inode, io_bio, err);
> - if (!err)
> - bio->bi_status = 0;
> - }
>  
>   unlock_extent(_I(inode)->io_tree, dip->logical_offset,
> dip->logical_offset + dip->bytes - 1);

This hunk is fairly easy, just reverse bi_status to bi->error.
However..

> @@ -8279,7 +8276,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  
>   kfree(dip);
>  
> - dio_bio->bi_status = bio->bi_status;
> + dio_bio->bi_status = err;
>   dio_end_io(dio_bio);
^^^

Same here, except that the call to dio_end_io used to take a second parameter
(the error code, which has been moved into bi_status in 4.10+) and looked
like this:

dio_end_io(dio_bio, bio->bi_error);

Given that "bio->bi_error" should have been "err" instead, I think err should
also be passed to dio_end_io(), so that the whole hunk would look like:

..
-dio_bio->bi_error = bio->bi_error;
-dio_end_io(dio_bio, bio->bi_error);
+dio_bio->bi_error = err;
+dio_end_io(dio_bio, err);
..

Would this be correct or did I misunderstand some subtle aspect about the
DIO error handling?

Thanks :)

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Odd fallocate behavior on BTRFS.

2017-08-01 Thread Holger Hoffstätte

On 08/01/17 20:15, Holger Hoffstätte wrote:
> On 08/01/17 19:34, Austin S. Hemmelgarn wrote:
> [..]
>> Apparently, if you call fallocate() on a file with an offset of 0 and
>> a length longer than the length of the file itself, BTRFS will
>> allocate that exact amount of space, instead of just filling in holes
>> in the file and allocating space to extend it.  If there isn't enough
>> space on the filesystem for this, then it will fail, even though it
>> would succeed on ext4, XFS, and F2FS.
> [..]
>> I'm curious to hear anybody's thoughts on this, namely: 1. Is this
>> behavior that should be considered implementation defined? 2. If not,
>> is my assessment that BTRFS is behaving incorrectly in this case
>> accurate?
> 
> IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3)
> make it very clear that the expected default behaviour is to extend.
> I don't think this can be interpreted in any other way than incorrect
> behaviour on behalf of btrfs.
> 
> Your script reproduces for me, so that's a start.

Your reproducer should never ENOSPC because it requires exactly 0 new bytes
to be allocated, yet it also fails with --keep-size.

>From a quick look it seems that btrfs_fallocate() unconditionally calls
btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily
allocate the necessary extent(s), which goes ENOSPC because that size
is again the full size of the requested range, not the difference between
the existing file size and the new range length. 
But I might be misreading things..

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Odd fallocate behavior on BTRFS.

2017-08-01 Thread Holger Hoffstätte

On 08/01/17 19:34, Austin S. Hemmelgarn wrote:
[..]
> Apparently, if you call fallocate() on a file with an offset of 0 and
> a length longer than the length of the file itself, BTRFS will
> allocate that exact amount of space, instead of just filling in holes
> in the file and allocating space to extend it.  If there isn't enough
> space on the filesystem for this, then it will fail, even though it
> would succeed on ext4, XFS, and F2FS.
[..]
> I'm curious to hear anybody's thoughts on this, namely: 1. Is this
> behavior that should be considered implementation defined? 2. If not,
> is my assessment that BTRFS is behaving incorrectly in this case
> accurate?

IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3)
make it very clear that the expected default behaviour is to extend.
I don't think this can be interpreted in any other way than incorrect
behaviour on behalf of btrfs.

Your script reproduces for me, so that's a start.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: add support to sort by topid

2017-07-13 Thread Holger Hoffstätte

On 07/13/17 15:25, Jeffrey Mahoney wrote:
> On 7/13/17 12:16 AM, Anand Jain wrote:
>> As users generally organize the subvols and snapshots based on the subvol
>> directory hierarchy. So providing an ability to sort them by topid would
>> help. Thanks.
> 
> What is a topid?  I needed to look at the code to discover this and it's
> not even documented as part of the root_info structure.  Users are going
> to have no idea what this means.

The idea seems similar to my attempt to add listing of toplevel-only
(i.e. non-snapshot) subvolumes, which can be found here:

https://github.com/hhoffstaette/btrfs-progs/commit/f0a065e02b

I ended up not submitting it since nobody seemed interested at the time
and I could work around it for my own purposes by filtering manually on
parent uuid (more precisely the lack thereof).

cheers
Holger

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr

2017-06-23 Thread Holger Hoffstätte

On 06/23/17 16:32, Chris Mason wrote:
[..]
> -static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
> +static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
>   u64 to_reclaim)
>  {
>   u64 bytes;
> - int nr;
> + u64 nr;
>  
>   bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
>   nr = (int)div64_u64(to_reclaim, bytes);
^^

Isn't this a bit odd? I can't even tell whether it matters, just randomly
noticed it because I genuinely dislike type casts..

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-06-18 Thread Holger Hoffstätte

On 06/18/17 13:23, Qu Wenruo wrote:
> Well, still no good news.
> 
> I created file extents which returns 0x2008 and the last extent with 0x9.
> But still failed to reproduce the error message.
> 
> BTW, I noticed that your output is a little strange.
> Normally we should have "offset=%llu phys=%llu len=%llu and flags=0x%x".
> 
> But your output seems a little out of shape.
> And further more, the len (if it's correct) is not aligned even to 512 bytes.
> 
> Seems something went wrong totally.

That seems to be a mail decoding/display problem at your side.
Adam's original mail contained:

  offset=0 phys=2435798867968 len=131072 flags=0x2008

So length looks fine.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] Btrfs: fix total_bytes_pinned counter

2017-06-07 Thread Holger Hoffstätte

On 06/07/17 01:45, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> This series fixes several problems with the total_bytes_pinned counter.
> Patches 1 and 2 are cleanups. Patches 3 and 4 are straightforward fixes.
> Patch 5 is prep for patch 6. Patch 6 is the most complicated fix.
> Patches 5 and 6 are ugly, I'd love any suggestions for a cleaner fix.
> Finally, patch 7 adds a warning to catch similar issues in the future.

Since this didn't really change the code significantly from the previous
version (except for the addition of minor patches 2+7) have a

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

Thanks!
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes

2017-06-03 Thread Holger Hoffstätte

On 06/02/17 20:14, Omar Sandoval wrote:
> On Fri, May 19, 2017 at 11:39:15AM -0600, Liu Bo wrote:
>> We commit transaction in order to reclaim space from pinned bytes because
>> it could process delayed refs, and in may_commit_transaction(), we check
>> first if pinned bytes are enough for the required space, we then check if
>> that plus bytes reserved for delayed insert are enough for the required
>> space.
>>
>> This changes the code to the above logic.
>>
>> Signed-off-by: Liu Bo 
>> ---
>>  fs/btrfs/extent-tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index e390451c72e6..bded1ddd1bb6 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
>> *fs_info,
>>  
>>  spin_lock(_rsv->lock);
>>  if (percpu_counter_compare(_info->total_bytes_pinned,
>> -   bytes - delayed_rsv->size) >= 0) {
>> +   bytes - delayed_rsv->size) < 0) {
>>  spin_unlock(_rsv->lock);
>>  return -ENOSPC;
>>  }
> 
> I found this bug in my latest enospc investigation, too. However, the
> total_bytes_pinned counter is pretty broken. E.g., on my laptop:
> 
> $ sync; grep -H '' /sys/fs/btrfs/*/allocation/*/total_bytes_pinned
> /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/data/total_bytes_pinned:48693501952
> /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/metadata/total_bytes_pinned:-258146304
> /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/system/total_bytes_pinned:0
> 
> I have a patch to fix it that I haven't cleaned up yet, below. Without
> it, Bo's fix will probably cause early ENOSPCs. Dave, should we pull
> Bo's patch out of for-next? In any case, I'll get my fix sent out.
[..patch snipped..]

This made me curious since I also found the underflowed metadata counter
on my system. I tried to reproduce it after un/remount (to reset the counters)
and noticed that I could reliably cause the metadata underflow by defragging
a few large subvolumes, which apparently creates enough extent tree movement
that the counter quickly goes bananas. It took some backporting, but with your
patch applied I can defrag away and have so far not seen a single counter
underflow; all of data/metadata/system are always positive after writes and
then reliably drop to 0 after sync/commit. Nice!
Just thought you'd want to know.

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Question re. required since btrfs-progs-4.10

2017-03-11 Thread Holger Hoffstätte

Hi,

I'm on Gentoo and wanted to update Docker to 17.03.0, which failed when it
couldn't build the btrfs driver due to a missing .
This worked fine on another machine the other day, so I dug in and found that
the only difference was an intermediate update to btrfs-progs 4.10.
Sure enough: since 4.10 ctree.h includes , which is nowhere
to be found in my package of linux-headers-4.10 (also not in 4.9).

Simply copying sizes.h from the 4.9 kernel sources made everything work,
but I'm curious whether this is actually correct - hence my question whether
 _should_ actually be installed? Is this a bug with
Gentoo's kernel header package or a new problem with btrfs-progs?

I don't see any good reason why sizes.h should not be installed, but wanted
to verify since there are probably users of other distributions here as
well. :)

Thanks,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: duperemove : some real world figures on BTRFS deduplication

2016-12-09 Thread Holger Hoffstätte

On 12/09/16 16:43, Chris Murphy wrote:
>> If compression has nothing to do with this, then this is heavy
>> fragmentation.
> 
> It's probably not that fragmented. Due to compression, metadata
> describes 128KiB extents even though the data is actually contiguous.
> 
> And it might be the same thing in my case also, even though no
> compression is involved.

In that case you can quickly collapse physically contiguous ranges by
reflink-mv'ing (ie. a recent mv) the file across subvolume boundaries
and back. :)

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: resend: Re: Btrfs: adjust len of writes if following a preallocated extent

2016-11-23 Thread Holger Hoffstätte

On 11/23/16 18:21, Stefan Priebe - Profihost AG wrote:
> Am 04.11.2016 um 20:20 schrieb Liu Bo:
>> If we have
>>
>> |0--hole--4095||4096--preallocate--12287|
>>
>> instead of using preallocated space, a 8K direct write will just
>> create a new 8K extent and it'll end up with
>>
>> |0--new extent--8191||8192--preallocate--12287|
>>
>> It's because we find a hole em and then go to create a new 8K
>> extent directly without adjusting @len.
> 
> after applying that one on top of my 4.4 btrfs branch (includes patches
> up to 4.10 / next). i'm getting deadlocks in btrfs.

*ctrl+f sectorsize* .. 

That's not surprising if you did what I suspect. If your tree is based
on my - now really very retired - 4.4.x queue, then you are likely missing
_all the other blocksize/sectorsize patches_ that came in from Chandra
Seetharaman et al., which I _really_ carefully patched around, for many
good reasons.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

2016-11-10 Thread Holger Hoffstätte

On 11/10/16 17:20, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote:
>> On 11/10/16 16:37, Omar Sandoval wrote:
>>> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
>>>> On 11/10/16 00:26, Omar Sandoval wrote:
>>>>> From: Omar Sandoval <osan...@fb.com>
>>>>>
>>>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
>>>>> errors coming from the qcow2 virtual drive in the host system. The qcow2
>>>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
>>>>> Every once in awhile, pread() or pwrite() would return EEXIST, which
>>>>> makes no sense. This turned out to be a bug in btrfs_get_extent().
>>>>>
>>>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>>>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>>>>> two threads race on adding the same extent map to an inode's extent map
>>>>> tree. However, if the added em is merged with an adjacent em in the
>>>>> extent tree, then we'll end up with an existing extent that is not
>>>>> identical to but instead encompasses the extent we tried to add. When we
>>>>> call merge_extent_mapping() to find the nonoverlapping part of the new
>>>>> em, the arithmetic overflows because there is no such thing. We then end
>>>>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>>>>> that can bubble all the way up to userspace.
>>>>>
>>>>> Fix it by extending the identical extent map special case.
>>>>>
>>>>> Signed-off-by: Omar Sandoval <osan...@fb.com>
>>>>> ---
>>>>> Applies to 4.9-rc4.
>>>>>
>>>>> Here [1] is a reproducer for this bug that doesn't involve firing up a
>>>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
>>>>> to debug this on my laptop without compiling a custom kernel and
>>>>> rebooting just to add printks [3].
>>>>>
>>>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
>>>>
>>>> I can't really make this reproducer fail. It builds and runs fine, but just
>>>> exits with no messages (other than the one about drop_caches in dmesg).
>>>> It creates the 1MB file and always returns 0. Ideas?
>>>>
>>>> -h
>>>
>>> It's a race condition, so it doesn't happen 100% of the time. I imagine
>>> it depends on the storage speed, as well. On my laptop, which is
>>> dm-crypt on top of an SSD, it works about 50% of the time. Could you
>>> just try running it 100 times or something and see if it fails?
>>
>> $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail"
>>
>> ..couple of thousand runs without problem, only lots of fallocating and
>> cache dropping.
>>
>> Oh well, I tried. :)
>>
>> -h
> 
> Just out of curiousity, what kind of disk were you trying this on? I've
> only been able to trigger it on my laptop and a VM running on my laptop.

Tried on both an SSD and an old slowpoke 2.5" rotational disk on USB2.
But I also have a ton of other patches and a custom CPU scheduler, so
everything it likely my fault anyway. Don't sweat it. :)

>From what I can tell the explanation of the problem and the change itself
make sense. Would have been nice to be able to repro.

cheers,

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

2016-11-10 Thread Holger Hoffstätte

On 11/10/16 16:37, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
>> On 11/10/16 00:26, Omar Sandoval wrote:
>>> From: Omar Sandoval <osan...@fb.com>
>>>
>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
>>> errors coming from the qcow2 virtual drive in the host system. The qcow2
>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
>>> Every once in awhile, pread() or pwrite() would return EEXIST, which
>>> makes no sense. This turned out to be a bug in btrfs_get_extent().
>>>
>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>>> two threads race on adding the same extent map to an inode's extent map
>>> tree. However, if the added em is merged with an adjacent em in the
>>> extent tree, then we'll end up with an existing extent that is not
>>> identical to but instead encompasses the extent we tried to add. When we
>>> call merge_extent_mapping() to find the nonoverlapping part of the new
>>> em, the arithmetic overflows because there is no such thing. We then end
>>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>>> that can bubble all the way up to userspace.
>>>
>>> Fix it by extending the identical extent map special case.
>>>
>>> Signed-off-by: Omar Sandoval <osan...@fb.com>
>>> ---
>>> Applies to 4.9-rc4.
>>>
>>> Here [1] is a reproducer for this bug that doesn't involve firing up a
>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
>>> to debug this on my laptop without compiling a custom kernel and
>>> rebooting just to add printks [3].
>>>
>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
>>
>> I can't really make this reproducer fail. It builds and runs fine, but just
>> exits with no messages (other than the one about drop_caches in dmesg).
>> It creates the 1MB file and always returns 0. Ideas?
>>
>> -h
> 
> It's a race condition, so it doesn't happen 100% of the time. I imagine
> it depends on the storage speed, as well. On my laptop, which is
> dm-crypt on top of an SSD, it works about 50% of the time. Could you
> just try running it 100 times or something and see if it fails?

$for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail"

..couple of thousand runs without problem, only lots of fallocating and
cache dropping.

Oh well, I tried. :)

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

2016-11-10 Thread Holger Hoffstätte

On 11/10/16 16:06, David Sterba wrote:
> On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
>> From: Omar Sandoval 
>> [snip]
>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>> two threads race on adding the same extent map to an inode's extent map
>> tree. However, if the added em is merged with an adjacent em in the
>> extent tree, then we'll end up with an existing extent that is not
>> identical to but instead encompasses the extent we tried to add. When we
>> call merge_extent_mapping() to find the nonoverlapping part of the new
>> em, the arithmetic overflows because there is no such thing. We then end
>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>> that can bubble all the way up to userspace.
> 
> Is this possibly the same bug that Liu Bo fixes in
> https://patchwork.kernel.org/patch/9413129/ ?

Seem similar, but I can't reproduce without that patch either..

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

2016-11-10 Thread Holger Hoffstätte

On 11/10/16 00:26, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> errors coming from the qcow2 virtual drive in the host system. The qcow2
> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> Every once in awhile, pread() or pwrite() would return EEXIST, which
> makes no sense. This turned out to be a bug in btrfs_get_extent().
> 
> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> two threads race on adding the same extent map to an inode's extent map
> tree. However, if the added em is merged with an adjacent em in the
> extent tree, then we'll end up with an existing extent that is not
> identical to but instead encompasses the extent we tried to add. When we
> call merge_extent_mapping() to find the nonoverlapping part of the new
> em, the arithmetic overflows because there is no such thing. We then end
> up trying to add a bogus em to the em_tree, which results in a EEXIST
> that can bubble all the way up to userspace.
> 
> Fix it by extending the identical extent map special case.
> 
> Signed-off-by: Omar Sandoval 
> ---
> Applies to 4.9-rc4.
> 
> Here [1] is a reproducer for this bug that doesn't involve firing up a
> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
> to debug this on my laptop without compiling a custom kernel and
> rebooting just to add printks [3].
> 
> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b

I can't really make this reproducer fail. It builds and runs fine, but just
exits with no messages (other than the one about drop_caches in dmesg).
It creates the 1MB file and always returns 0. Ideas?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs: make block group flags in balance printks human-readable

2016-11-08 Thread Holger Hoffstätte

On 11/07/16 22:40, Adam Borowski wrote:
> They're not even documented anywhere, letting users with no recourse but
> to RTFS.  It's no big burden to output the bitfield as words.
> 
> Also, display unknown flags as hex.
> 
> Signed-off-by: Adam Borowski 
[..]
>  
>  /*
> + * explain bit flags, prefixed by a '|' that'll be dropped
> + */
> +static char *describe_block_group_flags(char *buf, u64 flags)
> +{
> +#define BUF_SIZE 128
> + char *buf0 = buf = kmalloc(BUF_SIZE, GFP_NOFS);
[..]

Maybe I'm missing some clever (?) trick here, but what's the point of passing
in a potentially uninitialized 'buf' when it's immediately reassigned locally,
and a new value is returned and assigned at the call site?
IMHO you'd probably either want to pass the buffer in or return it, but not
both - and in that case the allocation should probably be hoisted out
into the caller as well, if only to make things a bit more symmetric.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: make block group flags in balance printks human-readable

2016-11-07 Thread Holger Hoffstätte

On 11/04/16 08:26, Adam Borowski wrote:
> They're not even documented anywhere, letting users with no recourse but
> to RTFS.  It's no big burden to output the bitfield as words.
> 
> Also, display unknown flags as hex.
> 
> Signed-off-by: Adam Borowski <kilob...@angband.pl>

Very helpful and works (for me) as advertised.

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

> ---
>  fs/btrfs/relocation.c | 34 --
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 0ec8ffa..388216f 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4326,6 +4326,34 @@ static struct reloc_control 
> *alloc_reloc_control(struct btrfs_fs_info *fs_info)
>  }
>  
>  /*
> + * explain bit flags, prefixed by a '|' that'll be dropped
> + */
> +static void describe_block_group_flags(char *buf, u64 flags)
> +{
> + if (!flags)
> + *buf += sprintf(buf, "|NONE");
> + else {
> +#define DESCRIBE_FLAG(f, d) \
> + if (flags & BTRFS_BLOCK_GROUP_##f) { \
> + buf += sprintf(buf, "|%s", d); \
> + flags &= ~BTRFS_BLOCK_GROUP_##f; \
> + }
> + DESCRIBE_FLAG(DATA, "data");
> + DESCRIBE_FLAG(SYSTEM,   "system");
> + DESCRIBE_FLAG(METADATA, "metadata");
> + DESCRIBE_FLAG(RAID0,"raid0");
> + DESCRIBE_FLAG(RAID1,"raid1");
> + DESCRIBE_FLAG(DUP,  "dup");
> + DESCRIBE_FLAG(RAID10,   "raid10");
> + DESCRIBE_FLAG(RAID5,"raid5");
> + DESCRIBE_FLAG(RAID6,"raid6");
> + if (flags)
> + buf += sprintf(buf, "|0x%llx", flags);
> + }
> + *buf = 0;
> +}
> +
> +/*
>   * function to relocate all extents in a block group.
>   */
>  int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 
> group_start)
> @@ -4337,6 +4365,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
> *extent_root, u64 group_start)
>   int ret;
>   int rw = 0;
>   int err = 0;
> + char flags_str[128];
>  
>   rc = alloc_reloc_control(fs_info);
>   if (!rc)
> @@ -4381,9 +4410,10 @@ int btrfs_relocate_block_group(struct btrfs_root 
> *extent_root, u64 group_start)
>   goto out;
>   }
>  
> + describe_block_group_flags(flags_str, rc->block_group->flags);
>   btrfs_info(extent_root->fs_info,
> -"relocating block group %llu flags %llu",
> -rc->block_group->key.objectid, rc->block_group->flags);
> +"relocating block group %llu flags %s",
> +rc->block_group->key.objectid, flags_str+1);
>  
>   btrfs_wait_block_group_reservations(rc->block_group);
>   btrfs_wait_nocow_writers(rc->block_group);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: imporve delayed refs iterations

2016-10-24 Thread Holger Hoffstätte

On 10/24/16 18:46, David Sterba wrote:
> On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote:
>> This issue was found when I tried to delete a heavily reflinked file,
>> when deleting such files, other transaction operation will not have a
>> chance to make progress, for example, start_transaction() will blocked
>> in wait_current_trans(root) for long time, sometimes it even triggers
>> soft lockups, and the time taken to delete such heavily reflinked file
>> is also very large, often hundreds of seconds. Using perf top, it reports
>> that:
> 
>> [...] With this patch, it just took about 10~15 seconds to
>> delte the same file.
> 
> Great improvement! Patch looks good on a quick skim so I'll add it to
> next, but proper review is still required.

If it helps, I've been running it for ~2 days now with no negative side
effects, mostly rsync creating & deleting files with various levels of
reflinking (via snapshots). No problems at all.
Also tried to manually create & delete a large file with heavy CoW and
hundreds of reflinked copies - no problem either and pretty fast.

So..

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: pass correct args to btrfs_async_run_delayed_refs()

2016-10-18 Thread Holger Hoffstätte

On Tue, 18 Oct 2016 15:56:13 +0800, Wang Xiaoguang wrote:

> In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
> swap the arg2 and arg3 wrongly, fix this.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>
> ---
>  fs/btrfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2b790bd..2f1372b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4605,8 +4605,8 @@ int btrfs_truncate_inode_items(struct 
> btrfs_trans_handle *trans,
>   BUG_ON(ret);
>   if (btrfs_should_throttle_delayed_refs(trans, root))
>   btrfs_async_run_delayed_refs(root,
> -  trans->transid,
> - trans->delayed_ref_updates * 2, 0);
> + trans->delayed_ref_updates * 2,
> + trans->transid, 0);
>   if (be_nice) {
>   if (truncate_space_check(trans, root,
>    extent_num_bytes)) {

Reviewed-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

Passing the wrong transid..why did this ever work?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: fix false enospc for compression

2016-10-14 Thread Holger Hoffstätte

On 10/06/16 04:51, Wang Xiaoguang wrote:
> When testing btrfs compression, sometimes we got ENOSPC error, though fs
> still has much free space, xfstests generic/171, generic/172, generic/173,
> generic/174, generic/175 can reveal this bug in my test environment when
> compression is enabled.
> 
> After some debuging work, we found that it's btrfs_delalloc_reserve_metadata()
> which sometimes tries to reserve plenty of metadata space, even for very small
> data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes
> we try to reserve is calculated by the difference between outstanding_extents
> and reserved_extents. Please see below case for how ENOSPC occurs:
> 
>   1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode
> outstanding extents be 1, and reserved_extents be 1024. Note it's
> btrfs_merge_extent_hook() that merges these 128KB units into one big
> outstanding extent, but do not change reserved_extents.
> 
>   2, When writing dirty pages, for compression, cow_file_range_async() will
> split above big extent in unit of 128KB(compression extent size is 128KB).
> When first split opeartion finishes, we'll have 2 outstanding extents and 1024
> reserved extents, and just right now the currently generated ordered extent is
> dispatched to run and complete, then btrfs_delalloc_release_metadata()(see
> btrfs_finish_ordered_io()) will be called to release metadata, after that we
> will have 1 outstanding extents and 1 reserved extents(also see logic in
> drop_outstanding_extent()). Later cow_file_range_async() continues to handles
> left data range[128KB, 128MB), and if no other ordered extent was dispatched
> to run, there will be 1023 outstanding extents and 1 reserved extent.
> 
>   3, Now if another bufferd write for this file enters, then
> btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
> for 1023 outstanding extents' metadata, for 16KB node size, it'll be 
> 1023*16384*2*8,
> about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, 
> so
> obviously it's not sane and can easily result in enospc error.
> 
> The root cause is that for compression, its max extent size will no longer be
> BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation
> method in btrfs is not appropriate or correct, here we introduce:
>   enum btrfs_metadata_reserve_type {
>   BTRFS_RESERVE_NORMAL,
>   BTRFS_RESERVE_COMPRESS,
>   };
> and expand btrfs_delalloc_reserve_metadata() and 
> btrfs_delalloc_reserve_space()
> by adding a new enum btrfs_metadata_reserve_type argument. When a data range 
> will
> go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata.
> Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go
> through compression path.
> 
> With this patch, we can fix these false enospc error for compression.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>

I took some time again to get this into my tree on top of what's in
btrfs-4.9rc1 and managed to merge it after all.

Both this and patch #1 seem to work fine, and they don't seem to cause any
regressions; ran a couple of both full and incremental rsync backups with
>100GB on a new and now compressed subvolume without problem. Also Stefan
just reported that his ENOSPC seems to be gone as well, so it seems to be
good. \o/

So for both this and patch #1 have a careful:

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

Also a comment about something I found while resolving conflicts caused
by the preliminary dedupe suppoort:

[..]
>  int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
> -   struct extent_state **cached_state);
> +   struct extent_state **cached_state, int flag);
>  int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
> - struct extent_state **cached_state);
> + struct extent_state **cached_state, int flag);
[..]
>  int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
> struct page **pages, size_t num_pages,
> loff_t pos, size_t write_bytes,
> -   struct extent_state **cached);
> +   struct extent_state **cached, int flag);

Instead of adding "int flag" why not use the already defined
btrfs_metadata_reserve_type enum? I know it's just an int at the end of
the day, but the dedupe support already added another "int dedupe" argument
and it's probably easy to cause confusion. 
Maybe later it would be beneficial to consolidate the flags into a consistent
set of enum values to prevent more "int flag" infla

Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full

2016-10-07 Thread Holger Hoffstätte

On 10/07/16 09:17, Wang Xiaoguang wrote:
> Hi,
> 
> On 10/07/2016 03:03 PM, Stefan Priebe - Profihost AG wrote:
>> Dear Wang,
>>
>> can't use v4.8.0 as i always get OOMs and total machine crashes.
>>
>> Complete traces with your patch and some more btrfs patches applied (in
>> the hope in fixes the OOM but it did not):
>> http://pastebin.com/raw/6vmRSDm1
> I didn't see any such OOMs...
> Can you try holger's tree with my patches.

They don't really apply to either 4.4.x (because it has diverged too
much by now) or 4.8.x because of the initial dedupe support which came
in as part of 4.9rc1 - there are way too many conflicts all over the
place and merging them manually took way too much time.
It would be useful if you could rebase your patches to for-next.

Stefan, have you tried setting THP to 'madvise' or 'never'?
Try 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'
or boot with transparent_hugepage=madvise (or never) kernel flag.
I have no idea if it will help, but it's worth a try.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.8rc8 & OOM panic

2016-10-03 Thread Holger Hoffstätte

On 09/28/16 21:45, Holger Hoffstätte wrote:
> On 09/28/16 20:46, E V wrote:
>> I just booted my backup box with 4.8rc8 and started an rsync onto
>> btrfs and it panic'd with OOM a couple hours later. I thought the OOM
>> problems from 4.7 we're supposed to be fixed in 4.8, or did I get that
>> wrong? No users or anything else on the system.
> 
> Keep in mind that those problems are generic, other filesystems also
> suffer: http://www.spinics.net/lists/linux-mm/msg114123.html
> 
> I don't see any recent compaction-related patches in Linus' tree at
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/mm/

Sorry, that was not quite correct. -stable 4.7.x and 4.8 contain
a workaround (http://goo.gl/DiwfPH) that is supposed to help.
That went into 4.8rc8 already, so not sure what happened in your case.

> So unless they get merged in the last minute it looks like 4.8 will
> be DOA.

Running 4.8 final now, let's see what happens..

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/6] Btrfs: free space tree and sanity test fixes

2016-09-29 Thread Holger Hoffstätte

On 09/29/16 14:21, Anatoly Pugachev wrote:
>> ...
>>
>> This is fixed by patch
>>
>>   "Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf"
>>
>> that's in the 4.9 queue. Other than that, the self-tests seem to pass,
>> thanks for the test. Would be good if you can test with the mentioned
>> patch included or without integrity checker. Thanks for testing.
> 
> updated git kernel to v4.8-rc8-8-gae6dd8d , applied this 
> "Btrfs: free space tree and sanity test fixes" patchset and added/applied 
> "Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf" :
> 
(snip)
> Sep 29 00:40:55 ttip kernel: BTRFS: device fsid 
> 7bb81df9-0e2b-47f2-81ff-c08502d38da6 devid 1 transid 5 /dev/loop4
> Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): disk space caching is 
> enabled
> Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): has skinny extents
> Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): flagging fs with big 
> metadata feature
> Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): creating UUID tree
> Sep 29 00:41:31 ttip kernel: BTRFS: device fsid 
> d0ee7ca3-3be0-465f-857b-19e681181923 devid 1 transid 5 /dev/loop0
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): enabling free space 
> tree
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): using free space tree
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): has skinny extents
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): flagging fs with big 
> metadata feature
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): creating free space 
> tree
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): setting 1 ro feature 
> flag
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): setting 2 ro feature 
> flag
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): creating UUID tree
> Sep 29 00:41:32 ttip kernel: BTRFS critical (device loop4): corrupt leaf, 
> non-root leaf's nritems is 0: block=29556736,root=1, slot=0
> Sep 29 00:41:32 ttip kernel: BTRFS info (device loop4): leaf 29556736 total 
> ptrs 0 free space 16283
> Sep 29 00:41:32 ttip kernel: BTRFS: assertion failed: 0, file: 
> fs/btrfs/disk-io.c, line: 4059kernel BUG at fs/btrfs/ctree.h:3369!

Try to add https://patchwork.kernel.org/patch/9332707/ aka
"Btrfs: improve check_node to avoid reading corrupted nodes" which should return
-EIO and prevent the BUG().

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.8rc8 & OOM panic

2016-09-28 Thread Holger Hoffstätte

On 09/28/16 20:46, E V wrote:
> I just booted my backup box with 4.8rc8 and started an rsync onto
> btrfs and it panic'd with OOM a couple hours later. I thought the OOM
> problems from 4.7 we're supposed to be fixed in 4.8, or did I get that
> wrong? No users or anything else on the system.

Keep in mind that those problems are generic, other filesystems also
suffer: http://www.spinics.net/lists/linux-mm/msg114123.html

I don't see any recent compaction-related patches in Linus' tree at
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/mm/

..so it seems they have not been merged yet.

So unless they get merged in the last minute it looks like 4.8 will
be DOA.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full

2016-09-28 Thread Holger Hoffstätte

On 09/28/16 15:06, Stefan Priebe - Profihost AG wrote:
> 
> Yes this is 4.4.22 and no i don't have qgroups enabled so it can't help.
> 
> # btrfs qgroup show /path/
> ERROR: can't perform the search - No such file or directory
> ERROR: can't list qgroups: No such file or director
> 
> This is the same output on all backup machines.

OK, that is really good to know (your other mails arrived just after I
sent mine). The fact that you see this problem with all kernels - even with
4.8rc - *and* on all machines is good (in a way) because it means I haven't
messed up anything, and we're not chasing ghosts caused by broken backport
patches.

>> would be unfortunate, but you could try to disable compression for a
>> while and see what  happens, assuming the space requirements allow this
>> experiment.
> Good idea but it does not. I hope i can reproduce this with my already
> existing testscript which i've now bumped to use a 37TB partition and
> big files rather than a 15GB part and small files. If i can reproduce it
> i can also check whether disabling compression fixes this.

Great. Remember to undo the compression on existing files, or create
them from scratch.

> No that's not the case. No rsync nor inplace is involved. I'm dumping
> differences directly from ceph and put them on top of a base image but
> only for 7 days. So it's not endless fragmenting the file. After 7 days
> a clean whole image is dumped.

That sounds sane but it's also not at all how you described things to me
previosuly ;) But OK. How do you "dump differences directly from
Ceph"? I'd assume the VM images are RBDs, but it sounds you're somehow
using overlayfs.

> yes and no - this is not idea and even very slow if your customers need
> backups on a daily basis. So you must be able to mount a specific backup
> very fast. And stacking on demand is mostly too slow - but this is far
> away from the topic in this thread.

I understand the desire to mount & immediately access backups - it's what
I do here at home too (every machine can access its own last #n backups
via NFS) and it's very useful.

Anyway..something is off and you successfully cause it while other
people apparently do not. Do you still use those nonstandard mount
options with extremely long transaction flush times?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full

2016-09-28 Thread Holger Hoffstätte

On 09/28/16 13:35, Wang Xiaoguang wrote:
> hello,
> 
> On 09/28/2016 07:15 PM, Stefan Priebe - Profihost AG wrote:
>> Dear list,
>>
>> is there any chance anybody wants to work with me on the following issue?
> Though I'm also somewhat new to btrfs, but I'd like to.
> 
>>
>> BTRFS: space_info 4 has 18446742286429913088 free, is not full
>> BTRFS: space_info total=98247376896, used=77036814336, pinned=0,
>> reserved=0, may_use=1808490201088, readonly=0
>>
>> i get this nearly every day.
>>
>> Here are some msg collected from today and yesterday from different servers:
>> | BTRFS: space_info 4 has 18446742182612910080 free, is not full |
>> | BTRFS: space_info 4 has 18446742254739439616 free, is not full |
>> | BTRFS: space_info 4 has 18446743980225085440 free, is not full |
>> | BTRFS: space_info 4 has 18446743619906420736 free, is not full |
>> | BTRFS: space_info 4 has 18446743647369576448 free, is not full |
>> | BTRFS: space_info 4 has 18446742286429913088 free, is not full
>>
>> What i tried so far without success:
>> - use vanilla 4.8-rc8 kernel
>> - use latest vanilla 4.4 kernel
>> - use latest 4.4 kernel + patches from holger hoffstaette

Was that 4.4.22? It contains a patch by Goldwyn Rodrigues called
"Prevent qgroup->reserved from going subzero" which should prevent
this from happening. This should only affect filesystems with enabled
quota; you said you didn't have quota enabled, yet some quota-only
patches caused problems on your system (despite being scheduled for
4.9 and apparently working fine everywhere else, even when I
specifically tested them *with* quota enabled).

So, long story short: something doesn't add up.

It means either:
- you tried my patchset for 4.4.21 (i.e. *without* the above patch)
  and should bump to .22 right away
- you _do_ have qgroups enabled for some reason (systemd?)
- your fs is corrupted and needs nuking
- you did something else entirely
- unknown unknowns aka. ¯\_(ツ)_/¯

There is also the chance that your use of compress-force (or rather
compression in general) causes leakage; compression runs asynchronously
and I wouldn't be surprised if that is still full of racy races..which
would be unfortunate, but you could try to disable compression for a
while and see what  happens, assuming the space requirements allow this
experiment.

You have also not told us whether this happens only on one (potentially
corrupted/confused) fs or on every one - my impression was that you have
several sharded backup filesystems/machines; not sure if that is still
the case. If it happens only on one specific fs chances are it's hosed.

> I also met enospc error in 4.8-rc6 when doing big files create and delete 
> tests,
> for my cases, I have written some patches to fix it.
> Would you please apply my patches to have a try:
> btrfs: try to satisfy metadata requests when every flush_space() returns
> btrfs: try to write enough delalloc bytes when reclaiming metadata space
> btrfs: make shrink_delalloc() try harder to reclaim metadata space

These are all in my series for 4.4.22 and seem to work fine, however
Stefan's workload has nothing directly to do with big files; instead
it's the worst case scenario in terms of fragmentation (of huge files) and
a huge number of extents: incremental backups of VMs via rsync --inplace 
with forced compression.

IMHO this way of making backups is suboptimal in basically every possible
way, despite its convenience appeal. With such huge space requirements
it would be more effective to have a "current backup" to rsync into and
then take a snapshot (for fs consistency), pack the snapshot to a tar.gz
(massively better compression than with btrfs), dump them into your Ceph
cluster as objects with expiry (preferrably a separate EC pool) and then
immediately delete the snapshot from the local fs. That should relieve the
landing fs from getting overloaded by COWing and too many snapshots (approx.
#VMs * #versions). The obvious downside is that restoring an archived
snapshot would require some creative efforts.

Other alternatives exist, but are probably even more (too) expensive.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: handle quota reserve failure properly

2016-09-16 Thread Holger Hoffstätte

On Thu, 15 Sep 2016 14:57:48 -0400, Josef Bacik wrote:

> btrfs/022 was spitting a warning for the case that we exceed the quota.  If we
> fail to make our quota reservation we need to clean up our data space
> reservation.  Thanks,
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 03da2f6..d72eaae 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4286,13 +4286,10 @@ int btrfs_check_data_free_space(struct inode *inode, 
> u64 start, u64 len)
>   if (ret < 0)
>   return ret;
>  
> - /*
> -  * Use new btrfs_qgroup_reserve_data to reserve precious data space
> -  *
> -  * TODO: Find a good method to avoid reserve data space for NOCOW
> -  * range, but don't impact performance on quota disable case.
> -  */
> + /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
>   ret = btrfs_qgroup_reserve_data(inode, start, len);
> + if (ret)
> + btrfs_free_reserved_data_space_noquota(inode, start, len);
>   return ret;
>  }
>  
> -- 
> 2.7.4

This came up before, though slightly different:
http://www.spinics.net/lists/linux-btrfs/msg56644.html

Which version is correct - with or without _noquota ?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix memory leak in do_walk_down

2016-09-14 Thread Holger Hoffstätte

On 09/14/16 04:02, Liu Bo wrote:
> The extent buffer 'next' needs to be free'd conditionally.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a940ab..779fd72 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8882,6 +8882,7 @@ static noinline int do_walk_down(struct 
> btrfs_trans_handle *trans,
>  >flags[level - 1]);
>   if (ret < 0) {
>   btrfs_tree_unlock(next);
> + free_extent_buffer(next);
>   return ret;
>   }
>  
> 

This was fixed a long time ago by the following patch by Josef:

"Btrfs: don't BUG() during drop snapshot"
https://patchwork.kernel.org/patch/7002791/

..which never got merged. I've been using it since 4.1.x until today
without problems.

Note that apparently Patchwork got confused here: downloading only the
"patch" fragment won't work; you'll need the full "mbox" attachment.
It also seems to have been sorted into the wrong list bucket for some
reason.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Commit 'synchronize incompat feature bits with sysfs files' still missing in for-next?

2016-09-13 Thread Holger Hoffstätte


I've noticed that the 4.5-rc commit 14e46e04:
'btrfs: synchronize incompat feature bits with sysfs files' [1]
was reverted later in [2], but despite fixes to protect sysfs
with locks & exorcise GFP_NOFS in favor of GFP_KERNEL it was
never reinstated - neither for 4.5-final, nor later..and it's
been quite a while since then. Is this still valid?

I ask because I've just noticed I've had this in my tree since
forever, but have never encountered a problem during balance -
probably because of all the other fixes.

thanks,
Holger

[1] goo.gl/2DBMSe
[2] goo.gl/cIKgv5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-12 Thread Holger Hoffstätte

On 09/12/16 09:38, Wang Xiaoguang wrote:

>> Actually even that is not true; both patches seem to be wrong in subtle
>> ways. Naohiro's patch seems to prevent the deletion during balance, whereas
>> yours prevents the cleaner from kicking in.
> Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex"
> to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when
> we allocate data space, so this patch should work as before :)
> 
>>
>> As a simple reproducer you can convert from -mdup to -msingle (to create
>> bloat) and then balance with -musage=10. Depending on which of the two
>> patches are applied, you end with bloat that only grows and never shrinks,
>> or bloat that ends up in mixed state (dup and single).
> Can you give me a simple test script to reproduce your problem.
> (I can write it myself, but I'm afraid I may misunderstand you :) )

I don't have a script and no time this week to play with this.
Just create an fs with dup metadata, balance -mconvert=single and then
back with -mconvert=dup. That's all I tried.

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Holger Hoffstätte

On 09/09/16 12:18, Holger Hoffstätte wrote:
> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
> 
>> cleaner_kthread() may run at any time, in which it'll call 
>> btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it may 
>> also result
>> in false ENOSPC error. 
> 
> 
> With this v3 I can now no longer balance (tested only with metadata).
> New chunks are allocated (as balance does) but nothing ever shrinks, until
> after unmount/remount, when the cleaner eventually kicks in.
> 
> This might be related to the recent patch by Naohiro Aota:
> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
> 
> which by itself doesn't seem to do any harm (i.e. everything still seems
> to work as expected).

Actually even that is not true; both patches seem to be wrong in subtle
ways. Naohiro's patch seems to prevent the deletion during balance, whereas
yours prevents the cleaner from kicking in.

As a simple reproducer you can convert from -mdup to -msingle (to create
bloat) and then balance with -musage=10. Depending on which of the two
patches are applied, you end with bloat that only grows and never shrinks,
or bloat that ends up in mixed state (dup and single).

Undoing both makes both balancing and cleaning work again.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Holger Hoffstätte

On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:

> cleaner_kthread() may run at any time, in which it'll call 
> btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also 
> result
> in false ENOSPC error. 

With this v3 I can now no longer balance (tested only with metadata).
New chunks are allocated (as balance does) but nothing ever shrinks, until
after unmount/remount, when the cleaner eventually kicks in.

This might be related to the recent patch by Naohiro Aota:
"btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

which by itself doesn't seem to do any harm (i.e. everything still seems
to work as expected).

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding only non-snapshots via btrfs subvol list

2016-09-08 Thread Holger Hoffstätte

On 07/21/16 16:55, Holger Hoffstätte wrote:
> I'm trying to find non-snapshots, i.e. 'top-level' subvolumes in a
> filesystem and this seems harder than it IMHO should be.
> 
> The fs is just like:
> 
> /mnt/stuff
>  subvolA
>  subvolA-date1
>  subvolA-date2
>  subvolB
>  subvolB-date1
>  subvolB-date2
> ..
> 
> All I want are the subvol{A,B} *without* the snapshots, but so
> far I haven't been able to accomplish this easily with "subvol list"
> and its options. -s lists only snapshots, but what I want is the
> exact opposite.

This question received a deafening lack of feedback, so I just took
a swing at this and apparently hit something.

When you have a set of subvols and snapshots like so:

$./btrfs subvolume list /t/btrfs
ID 257 gen 13 top level 5 path a
ID 258 gen 16 top level 5 path b
ID 259 gen 15 top level 5 path c
ID 260 gen 11 top level 5 path a1
ID 261 gen 12 top level 5 path a2
ID 263 gen 14 top level 5 path b1
ID 264 gen 15 top level 5 path c1
ID 265 gen 16 top level 5 path b2

where a,b,c are subvolumes and ?{1,2,3} are snapshots, you can now do:

$./btrfs subvolume list -P /t/btrfs
ID 257 gen 13 top level 5 path a
ID 258 gen 16 top level 5 path b
ID 259 gen 15 top level 5 path c

Is this of interest? I find it useful to iterate over all parent
subvols (though you'll still need cut or awk to get only the name)
without accidentally hitting the snapshots, or relying on fragile inhouse
naming conventions.

The -P was the only meaningful letter left (P for Parent). I first used
-S (for grown-up -s ;) but that was already used for matching getopt
on --sort. If -S is deemed better I can reroute that to -Z or something,
since it's unused in short form.

The patch is surprisingly small and was quite easy to write. Nice!

cheers
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please disable balance auto-resume for 4.9 (or even 4.8)

2016-08-25 Thread Holger Hoffstätte


Automatically resuming an interrupted balance has repeatedly caused all
sorts of problems because it creates a possible failure mode when a user
can least use it: after a crash/power loss/sudden reboot (which, like it
or not, is the de facto "fix random problems" approach for many people).

The idea behind the automnatic resume is good and important for cases
like automation and unattended operation, but nevertheless right now
it creates more problems than it fixes.

As far as I can see it should be easy enough to simply disable calling
btrfs_resume_balance_async() at least on mount (in open_ctree()) and
possibly on remount as well. The skip_balance flag could then simply
be ignored or removed.

I can't say how much work it would be to completely remove the persistent
balance state or whether it is useful to be kept around for resume, but at
least not continuing would stop filesystems from eating themselves further
on mount.

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: check btree node's nritems

2016-08-05 Thread Holger Hoffstätte

On 08/05/16 11:24, Holger Hoffstätte wrote:
> On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote:
> 
>> When btree node (level = 1) has nritems which equals to zero,
>> we can end up with panic due to insert_ptr()'s
>>
>> BUG_ON(slot > nritems);
>>
>> where slot is 1 and nritems is 0, as copy_for_split() calls
>> insert_ptr(.., path->slots[1] + 1, ...);
>>
>> A invalid value results in the whole mess, this adds the check
>> for btree's node nritems so that we stop reading block when
>> when something is wrong.
>>
>> Signed-off-by: Liu Bo <bo.li@oracle.com>
>> ---
>>  fs/btrfs/disk-io.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 37d1780..a5a22be 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root,
>>  return 0;
>>  }
>>  
>> +static noinline int check_node(struct btrfs_root *root,
>> +   struct extent_buffer *node)
>> +{
>> +unsigned long nr = btrfs_header_nritems(node);
>> +
>> +if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) {
>> +btrfs_crit(root->fs_info,
>> +   "corrupt node: block %llu root %llu nritems %lu\n",
> 
> I think the trailing \n can be dropped here, btrfs_crit() already provides
> a proper newline.

On top of that I get a whole bunch of false positives with this patch.
Files that are perfectly readable without it now error out, in which
case the logged nritems is always 493 - regardless of file or containing
subvolume. Something is fishy here.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: check btree node's nritems

2016-08-05 Thread Holger Hoffstätte

On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote:

> When btree node (level = 1) has nritems which equals to zero,
> we can end up with panic due to insert_ptr()'s
> 
> BUG_ON(slot > nritems);
> 
> where slot is 1 and nritems is 0, as copy_for_split() calls
> insert_ptr(.., path->slots[1] + 1, ...);
> 
> A invalid value results in the whole mess, this adds the check
> for btree's node nritems so that we stop reading block when
> when something is wrong.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/disk-io.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 37d1780..a5a22be 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root,
>   return 0;
>  }
>  
> +static noinline int check_node(struct btrfs_root *root,
> +struct extent_buffer *node)
> +{
> + unsigned long nr = btrfs_header_nritems(node);
> +
> + if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) {
> + btrfs_crit(root->fs_info,
> +"corrupt node: block %llu root %llu nritems %lu\n",

I think the trailing \n can be dropped here, btrfs_crit() already provides
a proper newline.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memory overflow or undeflow in free space tree / space_info?

2016-07-29 Thread Holger Hoffstätte


On Fri, 29 Jul 2016 22:57:36 +, Holger Hoffstätte wrote:

> The only other patch I just found missing and which looks like it
> could/should (I think?) work on top of the 4.4.x pagesize-based
> calculations in file.c is:
> 
> a2af23b7 "__btrfs_buffered_write: Pass valid file offset when
> releasing delalloc space"
> 
> Would that make sense?

No it wouldn't, not without some other sectorsize-related patches
that came before...and those would just make matters worse.

So forget the above.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memory overflow or undeflow in free space tree / space_info?

2016-07-29 Thread Holger Hoffstätte

On Fri, 29 Jul 2016 17:03:43 -0400, Josef Bacik wrote:

> On 07/29/2016 03:14 PM, Omar Sandoval wrote:
>> On Fri, Jul 29, 2016 at 12:11:53PM -0700, Omar Sandoval wrote:
>>> On Fri, Jul 29, 2016 at 08:40:26PM +0200, Stefan Priebe - Profihost AG 
>>> wrote:
 Dear list,

 i'm seeing btrfs no space messages frequently on big filesystems (> 30TB).

 In all cases i'm getting a trace like this one a space_info warning.
 (since commit [1]). Could someone please be so kind and help me
 debugging / fixing this bug? I'm using space_cache=v2 on all those systems.
>>>
>>> Hm, so I think this indicates a bug in space accounting somewhere else
>>> rather than the free space tree itself. I haven't debugged one of these
>>> issues before, I'll see if I can reproduce it. Cc'ing Josef, too.
>>
>> I should've asked, what sort of filesystem activity triggers this?
>>
> 
> Chris just fixed this I think, try his next branch from his git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
> 
> and see if it still happens.  Thanks,
> 
> Josef

Hi Josef,

can you say which patch you have in mind? The tree in question
doesn't have any of Chandra's pagesize/sectorsize patches (carefully
patched around, for stability and LTS patchability) so I hope it's
not the recent commit

8b8b08cb "fix delalloc accounting after copy_from_user faults"

because that would be too fiddly (at least for me) to backport
correctly.

The only other patch I just found missing and which looks like it
could/should (I think?) work on top of the 4.4.x pagesize-based
calculations in file.c is:

a2af23b7 "__btrfs_buffered_write: Pass valid file offset when
releasing delalloc space"

Would that make sense? Neither I nor any other users of that tree
have observed weird space-info underflows so far (and I use my
fs daily), so it's definitely something peculiar Stefan is doing
with his weird compressed rsync-inplace workload. Odd sector offsets
causing slowly creeping space_info underflow sounds to me like it
just might be the problem.

thanks,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: deal with unexpected return value in flush_space

2016-07-28 Thread Holger Hoffstätte

On Wed, 27 Jul 2016 18:42:03 -0700, Liu Bo wrote:

> Function start_transaction() can return ERR_PTR(1) when flush is
> BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
> 
> start_transaction (return ERR_PTR(1))
>   -> btrfs_block_rsv_add (return 1)
>  -> reserve_metadata_bytes (return 1)
> -> flush_space (return 1)
>-> do_chunk_alloc  (return 1)
> 
> With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
> flush_state of ALLOC_CHUNK and it successfully allocates a new
> chunk, then instead of trying to reserve space again,
> reserve_metadata_bytes returns 1 immediately.
> 
> Eventually the callers who call start_transaction() usually just
> do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
> a panic when dereferencing a pointer which is ERR_PTR(1).
> 
> This makes flush_space() translate 'ret = 1' to 'ret = 0'.
> 
> Signed-off-by: Liu Bo 
> ---
> We found this 'NULL pointer dereference' on an old 3.8 kernel but
> it's not going to happen on the upstream since there is no caller
> of btrfs_start_transaction_lflush().
> 
>  fs/btrfs/extent-tree.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7a35c9d..a00fb67 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4457,6 +4457,15 @@ void check_system_chunk(struct btrfs_trans_handle 
> *trans,
>   }
>  }
>  
> +/*
> + * If force is CHUNK_ALLOC_FORCE:
> + *- return 1 if it successfully allocates a chunk,
> + *- return errors including -ENOSPC otherwise.
> + * If force is NOT CHUNK_ALLOC_FORCE:
> + *- return 0 if it doesn't need to allocate a new chunk,
> + *- return 1 if it successfully allocates a chunk,
> + *- return errors including -ENOSPC otherwise.
> + */
>  static int do_chunk_alloc(struct btrfs_trans_handle *trans,
> struct btrfs_root *extent_root, u64 flags, int force)
>  {
> @@ -4857,7 +4866,7 @@ static int flush_space(struct btrfs_root *root,
>btrfs_get_alloc_profile(root, 0),
>CHUNK_ALLOC_NO_FORCE);
>   btrfs_end_transaction(trans, root);
> - if (ret == -ENOSPC)
> + if (ret == -ENOSPC || ret == 1)
>   ret = 0;
>   break;
>   case COMMIT_TRANS:
> -- 
> 2.5.5

For reviewers - this came up before here:
https://patchwork.kernel.org/patch/7778651/

Same fix basically.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Finding only non-snapshots via btrfs subvol list

2016-07-21 Thread Holger Hoffstätte


I'm trying to find non-snapshots, i.e. 'top-level' subvolumes in a
filesystem and this seems harder than it IMHO should be.

The fs is just like:

/mnt/stuff
 subvolA
 subvolA-date1
 subvolA-date2
 subvolB
 subvolB-date1
 subvolB-date2
..

All I want are the subvol{A,B} *without* the snapshots, but so
far I haven't been able to accomplish this easily with "subvol list"
and its options. -s lists only snapshots, but what I want is the
exact opposite.

So far the best I could find - except for relying on my ad-hoc naming
conventions and inverse-grepping for that - is via -q (print parent UUID)
and matching on that:

btrfs subvol list -q /mnt/stuff | grep "parent_uuid -" | cut -f 11 -d " "

gives me what I want - but eeww. So somehow I think I'm missing something
trivial. Is there a better, non-greppy way?

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] update bytes_may_use timely to avoid false ENOSPC issue

2016-07-20 Thread Holger Hoffstätte

On 07/20/16 07:56, Wang Xiaoguang wrote:
> Currently in btrfs, for data space reservation, it does not update
> bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation
> will be delayed to be done in extent_clear_unlock_delalloc(), for
> fallocate(2), decrease operation is even delayed to be done in end
> of btrfs_fallocate(), which is too late. Obviously such delay will
> cause unnecessary pressure to enospc system, in [PATCH 4/4], there is
> a simpel test script that can reveal such false ENOSPC bug.
> 
> So in this patch set, it will remove RESERVE_FREE, RESERVE_ALLOC and
> RESERVE_ALLOC_NO_ACCOUNT, and we always update bytes_may_use timely.
> 
> I'll also commit a fstests test case for this issue.
> 
> Wang Xiaoguang (4):
>   btrfs: use correct offset for reloc_inode in
> prealloc_file_extent_cluster()
>   btrfs: divide btrfs_update_reserved_bytes() into two functions
>   btrfs: introduce new EXTENT_CLEAR_DATA_RESV flag
>   btrfs: update btrfs_space_info's bytes_may_use timely
> 

Just like the previous version, for all 4 patches:

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

via the reproducer script & some very large manual fallocates.

thanks!
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC / no space on very large devices

2016-07-20 Thread Holger Hoffstätte

On 07/20/16 07:31, Stefan Priebe - Profihost AG wrote:
> Hi list,
> 
> while i didn't had the problem for some month i'm now getting ENOSPC on
> a regular basis on one host.

Well, it's getting better. :)

> if i umount the volume i get traces (i already did a clear_cache 4 days
> ago to recalculate the space_tree):
> 
> [545031.675797] [ cut here ]
> [545031.725166] WARNING: CPU: 1 PID: 17711 at
> fs/btrfs/extent-tree.c:5710 btrfs_free_block_groups+0x35a/0x400 [btrfs]()

This is "only" a warning, but as we can see below it indicates a real
problem. The warning was added only recently to for-next by the patch called
"Btrfs: warn_on for unaccounted spaces" [1], but I've had it in my tree
forever. Never seen the warning myself.

(snip)
> [545037.909700] BTRFS: space_info 4 has 18446743523026157568 free, is
> not full

Wow, ~18.4 exabytes really is a lot of free space. :)
So it looks like something underflowed the space_info and now things are
confused for about ~550 GB. Unfortunately I have no good idea how to fix
that. :(

> The kernel is something special - i'm using this one from holger:
> https://github.com/hhoffstaette/kernel-patches
> 
> which is basically a 4.4.15 + several patches especially a lot of btrfs
> patches up to 4.8 i think.

More like for-next with all the pagesize/sectorsize stuff carefully
avoided. I'm really looking forward to 4.8, this is becoming unwieldy..

-h

[1] 
https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git/commit/?h=for-next=d555b6c380c644af63dbdaa7cc14bba041a4e4dd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kdave/for-next commit 26112f7f472

2016-07-08 Thread Holger Hoffstätte

On 07/08/16 13:55, Jeff Mahoney wrote:
> On 7/8/16 7:19 AM, Holger Hoffstätte wrote:
>> On 07/08/16 06:24, Jeff Mahoney wrote:
>>> Hi Dave -
>>>
>>> This commit introduces a bug.  I ran across it when running xfstests
>>> against my own integrated branch.
>>
>> I can't find that commit id anywhere...?
> 
> Hi Holger -
> 
> This is the for-next branch.  It's not in any mainline branch yet.

Yes, I understand that. I searched in the github/kdave tree, which only
has the for-next-xyz branches. Found it in the one on kernel.org.

-h




signature.asc
Description: OpenPGP digital signature

Re: kdave/for-next commit 26112f7f472

2016-07-08 Thread Holger Hoffstätte

On 07/08/16 06:24, Jeff Mahoney wrote:
> Hi Dave -
> 
> This commit introduces a bug.  I ran across it when running xfstests
> against my own integrated branch.

I can't find that commit id anywhere...?

> The problem is that btrfs_calc_reclaim_metadata_size didn't used to be
> called from recovery, so it was safe to use fs_info->fs_root.  With
> commit 7c83c6a09 (Btrfs: don't bother kicking async if there's nothing
> to reclaim) we do call it from recovery context and fs_info->fs_root is
> NULL.
> 
> The fix is to just not switch btrfs_calc_reclaim_metadata_size to take
> an fs_info.  All the other call sites were using fs_info->fs_root
> anyway, so it's not like we're pinning a root somewhere just for this call.

I've had this patch from last October in my 4.4.x tree forever:
http://www.spinics.net/lists/linux-btrfs/msg48457.html

Apparently it fell off the table. Shouldn't that fix it?

-h




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Holger Hoffstätte

On 07/06/16 17:20, Hugo Mills wrote:
> On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote:
>> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte
>> <hol...@applied-asynchrony.com> wrote:
>>> On 07/06/16 14:25, Wang Shilong wrote:
>>>> 'btrfs file du' is a very useful tool to watch my system
>>>> file usage with snapshot aware.
>>>>
>>>> when trying to run following commands:
>>>> [root@localhost btrfs-progs]# btrfs file du /
>>>>  Total   Exclusive  Set shared  Filename
>>>> ERROR: Failed to lookup root id - Inappropriate ioctl for device
>>>> ERROR: cannot check space of '/': Unknown error -1
>>>>
>>>> and My Filesystem looks like this:
>>>> [root@localhost btrfs-progs]# df -Th
>>>> Filesystem Type  Size  Used Avail Use% Mounted on
>>>> devtmpfs   devtmpfs   16G 0   16G   0% /dev
>>>> tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
>>>> tmpfs  tmpfs  16G  1.4M   16G   1% /run
>>>> tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
>>>> /dev/sda3  btrfs  60G   19G   40G  33% /
>>>> tmpfs  tmpfs  16G  332K   16G   1% /tmp
>>>> /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
>>>> /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
>>>> /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
>>>> tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
>>>>
>>>> So I installed Btrfs as my root partition, but boot partition
>>>> can be other fs.
>>>>
>>>> We can Let btrfs tool aware of this is not a btrfs file or
>>>> directory and skip those files, so that someone like me
>>>> could just run 'btrfs file du /' to scan all btrfs filesystems.
>>>>
>>>> After patch, it will look like:
>>>>Total   Exclusive  Set shared  Filename
>>>> skipping not btrfs dir/file: boot
>>>> skipping not btrfs dir/file: dev
>>>> skipping not btrfs dir/file: proc
>>>> skipping not btrfs dir/file: run
>>>> skipping not btrfs dir/file: sys
>>>>  0.00B   0.00B   -  //root/.bash_logout
>>>>  0.00B   0.00B   -  //root/.bash_profile
>>>>  0.00B   0.00B   -  //root/.bashrc
>>>>  0.00B   0.00B   -  //root/.cshrc
>>>>  0.00B   0.00B   -  //root/.tcshrc
>>>>
>>>> This works for me to analysis system usage and analysis
>>>> performaces.
>>>
>>> This is great, but can we please skip the "skipping .." messages?
>>> Maybe it's just me but I really don't see the value of printing them
>>> when they don't contribute to the result.
>>> They also mess up the display. :)
>>
>> I don't have a taste whether it needed or not, because it is somehow
>> useful to let users know some files/directories skipped

When you run "find /path -type d" you don't get messages for all the
things you just didn't want to find either.

>At the absolute minimum, I think that these messages should go to
> stderr (like du does when it deosn't have permissions), and should go
> away with -q. They're still irritating, but at least you can get rid
> of them easily.

If anything this should require a --verbose, not the other way
around. Maybe instead of breaking the output just indicate the
special status via "-- --" values, or default to 0.00?
Still, we're explicitly only interested in btrfs stuff and not
anything else, so printing non-information can only yield noise.

This is very much orthogonal to not printing anything after an
otherwise successful command execution.

-h




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Holger Hoffstätte

On 07/06/16 14:25, Wang Shilong wrote:
> 'btrfs file du' is a very useful tool to watch my system
> file usage with snapshot aware.
> 
> when trying to run following commands:
> [root@localhost btrfs-progs]# btrfs file du /
>  Total   Exclusive  Set shared  Filename
> ERROR: Failed to lookup root id - Inappropriate ioctl for device
> ERROR: cannot check space of '/': Unknown error -1
> 
> and My Filesystem looks like this:
> [root@localhost btrfs-progs]# df -Th
> Filesystem Type  Size  Used Avail Use% Mounted on
> devtmpfs   devtmpfs   16G 0   16G   0% /dev
> tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
> tmpfs  tmpfs  16G  1.4M   16G   1% /run
> tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
> /dev/sda3  btrfs  60G   19G   40G  33% /
> tmpfs  tmpfs  16G  332K   16G   1% /tmp
> /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
> /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
> /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
> tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
> 
> So I installed Btrfs as my root partition, but boot partition
> can be other fs.
> 
> We can Let btrfs tool aware of this is not a btrfs file or
> directory and skip those files, so that someone like me
> could just run 'btrfs file du /' to scan all btrfs filesystems.
> 
> After patch, it will look like:
>Total   Exclusive  Set shared  Filename
> skipping not btrfs dir/file: boot
> skipping not btrfs dir/file: dev
> skipping not btrfs dir/file: proc
> skipping not btrfs dir/file: run
> skipping not btrfs dir/file: sys
>  0.00B   0.00B   -  //root/.bash_logout
>  0.00B   0.00B   -  //root/.bash_profile
>  0.00B   0.00B   -  //root/.bashrc
>  0.00B   0.00B   -  //root/.cshrc
>  0.00B   0.00B   -  //root/.tcshrc
> 
> This works for me to analysis system usage and analysis
> performaces.

This is great, but can we please skip the "skipping .." messages?
Maybe it's just me but I really don't see the value of printing them
when they don't contribute to the result.
They also mess up the display. :)

thanks,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()

2016-07-06 Thread Holger Hoffstätte

On 07/06/16 12:37, Wang Xiaoguang wrote:
> Below test scripts can reproduce this false ENOSPC:
>   #!/bin/bash
>   dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
>   dev=$(losetup --show -f fs.img)
>   mkfs.btrfs -f -M $dev
>   mkdir /tmp/mntpoint
>   mount /dev/loop0 /tmp/mntpoint
>   cd mntpoint
>   xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile
> 
> Above fallocate(2) operation will fail for ENOSPC reason, but indeed
> fs still has free space to satisfy this request. The reason is
> btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use
> just in time, and it calls btrfs_free_reserved_data_space_noquota() in
> the end of btrfs_fallocate(), which is too late and have already added
> false unnecessary pressure to enospc system. See call graph:
> btrfs_fallocate()
> |-> btrfs_alloc_data_chunk_ondemand()
> It will add btrfs_space_info's bytes_may_use accordingly.
> |-> btrfs_prealloc_file_range()
> It will call btrfs_reserve_extent(), but note that alloc type is
> RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will
> only increase btrfs_space_info's bytes_reserved accordingly, but
> will not decrease btrfs_space_info's bytes_may_use, then obviously
> we have overestimated real needed disk space, and it'll impact
> other processes who do write(2) or fallocate(2) operations, also
> can impact metadata reservation in mixed mode, and bytes_max_use
> will only be decreased in the end of btrfs_fallocate(). To fix
> this false ENOSPC, we need to decrease btrfs_space_info's
> bytes_may_use in btrfs_prealloc_file_range() in time, as what we
> do in cow_file_range(),
> See call graph in :
> cow_file_range()
> |-> extent_clear_unlock_delalloc()
> |-> clear_extent_bit()
> |-> btrfs_clear_bit_hook()
> |-> btrfs_free_reserved_data_space_noquota()
> This function will decrease bytes_may_use accordingly.
> 
> So this patch choose to call btrfs_free_reserved_data_space() in
> __btrfs_prealloc_file_range() for both successful and failed path.
> 
> Also this patch removes some old and useless comments.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>

Verified that the reproducer script indeed fails (with btrfs ~4.7) and
the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating
things without problem. Free space also still seems sane, as far as I
could tell.

So for both patches:

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] btrfs-progs: check improve 'checking extents' scalability

2016-06-23 Thread Holger Hoffstätte

On 06/23/16 21:26, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> While running xfstests generic/291, which creates a single file populated
> with reflinks to the same extent, I found that fsck had been running for
> hours.  perf top lead me to find_data_backref as the culprit, and a litte
> more digging made it clear: For every extent record we add, we iterate
> the entire list first.  My test case had ~2M records.  That math doesn't
> go well.
> 
> This patchset converts the extent_backref list to an rbtree.  The test
> that used to run for more than 8 hours without completing now takes
> less than 20 seconds.

Very interesting. Time for science!

unpatched:

holger>time btrfs check /dev/sdc1 
Checking filesystem on /dev/sdc1
..
btrfs check /dev/sdc1  17.03s user 3.79s system 25% cpu 1:22.82 total

patched:

holger>time btrfs check /dev/sdc1
Checking filesystem on /dev/sdc1
..
btrfs check /dev/sdc1  17.03s user 3.74s system 24% cpu 1:23.24 total

This is a 1TB disk with ~850GB data in 4 subvolumes, ~2 snapshots each.
I guess it only starts to matter (relative to the necessary I/O cost per
extent) when the level of sharing is higher, i.e. many more snapshots?

OTOH it doesn't explode, so that's good. :)

cheers
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] btrfs: wait for bdev put

2016-06-21 Thread Holger Hoffstätte

On 06/21/16 12:24, Anand Jain wrote:
> From: Anand Jain <anand.j...@oracle.com>
> 
> Further to the commit
>  bc178622d40d87e75abc131007342429c9b03351
>  btrfs: use rcu_barrier() to wait for bdev puts at unmount
> 
> This patch implements a method to time wait on the __free_device()
> which actually does the bdev put. This is needed as the user space
> running 'btrfs fi show -d' immediately after the replace and
> unmount, is still reading older information from the device.
> 
>  mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html
> 
> Signed-off-by: Anand Jain <anand.j...@oracle.com>
> [updates: bc178622d40d87e75abc131007342429c9b03351]
> ---
> v2: Also to make sure bdev_closing is set it needs rcu_barrier(),
> restored rcu_barrier().

Looks like this one works reliably again. ;)
Tested with a slow disk, no long unmounts or timeout messages.

Tested-by: Holger Hoffstätte <holger.hoffstae...@applied-asynchrony.com>

thanks!
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: wait for bdev put

2016-06-18 Thread Holger Hoffstätte

On Tue, 14 Jun 2016 18:55:26 +0800, Anand Jain wrote:

> Further to the previous commit
>  bc178622d40d87e75abc131007342429c9b03351
>  btrfs: use rcu_barrier() to wait for bdev puts at unmount
> 
> Since free_device() spinoff __free_device() the rcu_barrier() for
>   call_rcu(>rcu, free_device);
> didn't help.
> 
> This patch reverts changes by
>  bc178622d40d87e75abc131007342429c9b03351
> and implement a method to wait on the __free_device() by using
> a new bdev_closing member in struct btrfs_device.
> 
> Signed-off-by: Anand Jain 
> [rework: bc178622d40d87e75abc131007342429c9b03351]
> ---
>  fs/btrfs/volumes.c | 44 ++--
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a4e8d48acd4b..404ce1daebb1 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -254,6 +255,17 @@ static struct btrfs_device *__alloc_device(void)
>   return dev;
>  }
>  
> +static int is_device_closing(struct list_head *head)
> +{
> + struct btrfs_device *dev;
> +
> + list_for_each_entry(dev, head, dev_list) {
> + if (dev->bdev_closing)
> + return 1;
> + }
> + return 0;
> +}
> +
>  static noinline struct btrfs_device *__find_device(struct list_head *head,
>  u64 devid, u8 *uuid)
>  {
> @@ -832,12 +844,22 @@ again:
>  static void __free_device(struct work_struct *work)
>  {
>   struct btrfs_device *device;
> + struct btrfs_device *new_device_addr;
>  
>   device = container_of(work, struct btrfs_device, rcu_work);
>  
>   if (device->bdev)
>   blkdev_put(device->bdev, device->mode);
>  
> + /*
> +  * If we are coming here from btrfs_close_one_device()
> +  * then it allocates a new device structure for the same
> +  * devid, so find device again with the devid
> +  */
> + new_device_addr = __find_device(>fs_devices->devices,
> + device->devid, NULL);
> +
> + new_device_addr->bdev_closing = 0;
>   rcu_string_free(device->name);
>   kfree(device);
>  }
> @@ -884,6 +906,12 @@ static void btrfs_close_one_device(struct btrfs_device 
> *device)
>   list_replace_rcu(>dev_list, _device->dev_list);
>   new_device->fs_devices = device->fs_devices;
>  
> + /*
> +  * So to wait for kworkers to finish all blkdev_puts,
> +  * so device is really free when umount is done.
> +  */
> + new_device->bdev_closing = 1;
> +
>   call_rcu(>rcu, free_device);
>  }
>  
> @@ -912,6 +940,7 @@ int btrfs_close_devices(struct btrfs_fs_devices 
> *fs_devices)
>  {
>   struct btrfs_fs_devices *seed_devices = NULL;
>   int ret;
> + int retry_cnt = 5;
>  
>   mutex_lock(_mutex);
>   ret = __btrfs_close_devices(fs_devices);
> @@ -927,12 +956,15 @@ int btrfs_close_devices(struct btrfs_fs_devices 
> *fs_devices)
>   __btrfs_close_devices(fs_devices);
>   free_fs_devices(fs_devices);
>   }
> - /*
> -  * Wait for rcu kworkers under __btrfs_close_devices
> -  * to finish all blkdev_puts so device is really
> -  * free when umount is done.
> -  */
> - rcu_barrier();
> +
> + while (is_device_closing(_devices->devices) &&
> + --retry_cnt) {
> + mdelay(1000); //1 sec
> + }
> +
> + if (!(retry_cnt > 0))
> + printk(KERN_WARNING "BTRFS: %pU bdev_put didn't complete, 
> giving up\n",
> + fs_devices->fsid);
>   return ret;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 0ac90f8d85bd..945e49f5e17d 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -150,6 +150,7 @@ struct btrfs_device {
>   /* Counter to record the change of device stats */
>   atomic_t dev_stats_ccnt;
>   atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> + int bdev_closing;
>  };
>  
>  /*
> -- 
> 2.7.0

I gave this a try and somehow it seems to make unmounting worse:
it now always takes ~5s (max retry time) and I see the warning every
time. Without the patch unmounting a single volume (disk) is much
faster (1-2s), without problems.

Any ideas?

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Another 'traling page extent' observation

2016-06-14 Thread Holger Hoffstätte

Hi,

I have made another observation regarding extra extents; it seems I'm
good at finding these things. Sorry. ;-)

This time it's with rsync. I found it when I started to use the --inplace
option, which doesn't do rsync's usual write-to-temporary/rename cycle, but
instead writes to a destination file directly. All of a sudden many newly
backed up files had a traling 4k extent, for no good reason.

This has nothing to do with extending overwrites (where new extents would
of course be fine); it happens when the file is new. It is also independent
of the file size or the filesystem state: it does not seem to be caused by
fragmented free space.

Reproducer example (current dir is btrfs):

$ls -al /tmp/data
-rw-r--r-- 1 root root 17569552 Jun 14 16:33 /tmp/data

$rm -f data && rsync /tmp/data . && sync && filefrag -ek data
Filesystem type is: 9123683e
File size of data is 17569552 (17160 blocks of 1024 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   17159:   53918020..  53935179:  17160: last,eof
data: 1 extent found

$rm -f data && rsync --inplace /tmp/data . && sync && filefrag -ek data
Filesystem type is: 9123683e
File size of data is 17569552 (17160 blocks of 1024 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   17155:   48133532..  48150687:  17156:
   1:17156..   17159:   36734592..  36734595:  4:   48150688: last,eof
data: 2 extents found

This is repeatable and independent of the file, so I suspect that after
Liu Bo's fix for the previously reported stray extents in the middle
of the file with slow buffered writes [1] there's an edge case where a page
is still treated differently at the end after close()-ing the file - which
rsync does correctly.

This is on my 4.4.x++ kernel with btrfs ~4.7 and space_cache=v2, but since
it also happens on a fresh volume with v1 it's probably just another
off-by-one somewhere in the writeback/page handling.

thanks,
Holger

[1] commit a91326679f aka "Btrfs: make mapping->writeback_index point to the
last written page"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug] Kernel 4.6.1 errors

2016-06-07 Thread Holger Hoffstätte

On Tue, 07 Jun 2016 17:54:26 +0200, g6094199 wrote:

> he guys!
> 
> I´m running Debian Sid where i have found several kernel errors today
> and most of them are btrfs related.
> 
> uname -a
> Linux NAS 4.6.0-trunk-amd64 #1 SMP Debian 4.6-1~exp1 (2016-05-17) x86_64
> GNU/Linux

This is 4.6.0-whatever, not 4.6.1. The problem is fixed in the real
4.6.1 (released last week), which you can find described in commit 4704fa54..
at https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.6.1

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hot data tracking / hybrid storage

2016-05-29 Thread Holger Hoffstätte

On 05/29/16 19:53, Chris Murphy wrote:
> But I'm skeptical of bcache using a hidden area historically for the
> bootloader, to put its device metadata. I didn't realize that was the
> case. Imagine if LVM were to stuff metadata into the MBR gap, or
> mdadm. Egads.

On the matter of bcache in general this seems noteworthy:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667

bummer..

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56

2016-05-15 Thread Holger Hoffstätte

On 05/14/16 02:06, Liu Bo wrote:
> This BUG() has been triggered by a fuzz testing image, but in fact
> btrfs can handle this gracefully by returning -EIO.
> 
> Thus, use WARN_ONCE for warning purpose and don't leave a possible
> kernel panic.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/raid56.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 0b7792e..863f7fe 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, 
> struct bio *bio,
>  
>   rbio->faila = find_logical_bio_stripe(rbio, bio);
>   if (rbio->faila == -1) {
> - BUG();
> + WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n");

I'm generally in favor of not BUGing out for no good reason, but what
is e.g. an admin (or user) supposed to do when he sees this message?
Same for the other rather cryptic WARNs - they contain no actionable
information, and are most likely going to be ignored as "debug spam".
IMHO things that can be ignored can be deleted.

thanks,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix unexpected balance crash due to BUG_ON

2016-05-03 Thread Holger Hoffstätte

On Tue, May 3, 2016 at 1:01 AM, Liu Bo  wrote:
> Mounting a btrfs can resume previous balance operations asynchronously.
> An user got a crash when one drive has some corrupt sectors.
>
> Since balance can cancel itself in case of any error, we can gracefully
> return errors to upper layers and let balance do the cancel job.
>
> Reported-by: sash 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/volumes.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index bd0f45f..5aed2e2 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3418,13 +3418,25 @@ static int __btrfs_balance(struct btrfs_fs_info 
> *fs_info)
> ret = btrfs_shrink_device(device, old_size - size_to_free);
> if (ret == -ENOSPC)
> break;
> -   BUG_ON(ret);
> +   if (ret) {
> +   /* btrfs_shrink_device never returns ret > 0 */
> +   WARN_ON_ONCE(ret > 0);
> +   goto error;
> +   }
>
> trans = btrfs_start_transaction(dev_root, 0);
> -   BUG_ON(IS_ERR(trans));
> +   if (IS_ERR(trans)) {
> +   ret = PTR_ERR(trans);
> +   goto error;
> +   }
>
> ret = btrfs_grow_device(trans, device, old_size);
> -   BUG_ON(ret);
> +   if (ret) {
> +   btrfs_end_transaction(trans, dev_root);
> +   /* btrfs_grow_device never returns ret > 0 */
> +   WARN_ON_ONCE(ret > 0);
> +   goto error;
> +   }
>
> btrfs_end_transaction(trans, dev_root);
> }

Just a heads up that this seems to introduce a valid warning, since it now
can goto error before the first initializing use of path:

fs/btrfs/volumes.c: In function 'btrfs_balance':
fs/btrfs/volumes.c:3601:2: warning: 'path' may be used uninitialized
in this function [-Wmaybe-uninitialized]
  btrfs_free_path(path);
  ^
fs/btrfs/volumes.c:3385:21: note: 'path' was declared here
  struct btrfs_path *path;
 ^
(it's really in __btrfs_balance which got inlined, so gcc thinks it's
at the call site).

Simply setting path = NULL at the beginning of __btrfs_balance fixes it, since
btrfs_free_path allows NULL values.

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Add device while rebalancing

2016-04-27 Thread Holger Hoffstätte

On 04/27/16 18:40, Juan Alberto Cirez wrote:
> If this is so, then it leaves even confused. I was under the
> impression that the driving imperative for the creation of btrfs was
> to address the shortcomings of current filesystems (within the context
> of distributed data). That the idea was to create a low level
> filesystem that would be the primary choice as a block/brick layer for a
> scale-out, distributed data storage...

I can't speak for who was or is motivated by what. Btrfs was a necessary
reaction to ZFS, and AFAIK this had nothing to do with distributed storage
but rather growing concerns around reliability (checksumming), scalability
and operational ease: snapshotting, growing/shrinking etc.

It's true that some of btrfs' capabilities make it look like a a good
candidate, and e.g. Ceph started out using it. For many reasons that
didn't work out (AFAIK btrfs maturity + extensibility) - but it also
did not address a fundamental mismatch in requirements, which other
filesystems (ext4, xfs) could not address either. btrfs simply
does "too much" because it has to; you cannot remove or turn off half
of what makes a kernel-based filesystem a usable filesystem. This is
kind of sad because at its core btrfs *is* an object store with
various trees for metadata handling and whatnot - but there's no
easy way to turn off all the "Unix is stupid" stuff.

AFAIK Gluster will soon also start managing xattrs differently,
so this is not limited to Ceph.

I've been following this saga for several years now and it's
absolutely *astounding* how many bugs and performance problems
Ceph has unearthed in existing filesystems, simply because it
stresses them in ways they never have been stressed before..only to
create the illusion of a distributed key/value store, badly.
I don't want to argue about details, you can read more about some
of the reasons in [1].

[grumble grumble exokernels and composable things in userland grumble]

cheers
Holger

[1] http://www.slideshare.net/sageweil1/ceph-and-rocksdb

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Add device while rebalancing

2016-04-27 Thread Holger Hoffstätte

On 04/27/16 17:58, Juan Alberto Cirez wrote:
> Correct me if I'm wrong but the sum total of the above seems to
> suggest (at first glance) that BRTFS add several layers of complexity,
> but for little real benefit (at least in the case use of btrfs at the
> brick layer with a distributed filesystem on top)...

This may come as a surprise, but the same can be said for every other
(common) filesystem (+ device management stack) that can be used
standalone.

Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
current filesystems and their historically grown requirements (mostly
as they relate to the POSIX interface standard) are in many ways
just not a good fit for scale-out/redundant storage:
http://pl.atyp.us/2016-05-updating-posix.html

Quite a few of the capabilities & features which are useful or
necessary in standalone operation (regardless of single- or multi-
device setup) are *actively unhelpful* in a distributed context, which
is why e.g. Ceph will soon do away with the on-disk filesystem for
data, and manage metadata exclusively by itself.

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question: raid1 behaviour on failure

2016-04-26 Thread Holger Hoffstätte

On 04/26/16 18:19, Henk Slager wrote:
> It looks like a JMS567 + SATA port multipliers behaind it are used in
> this drivebay. The command   lsusb -v  could show that. So your HW
> setup is like JBOD, not RAID.

I hate to quote the "harmful" trope, but..

SATA Port Multipliers Considered Harmful
https://www.usenix.org/system/files/fastpw13-paper7_0.pdf

aka: how to make any RAID setup useless in 1 easy step.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fixes for races in relocation and avoid start and wait for unrelated IO

2016-04-26 Thread Holger Hoffstätte

On Mon, Apr 25, 2016 at 3:01 AM,   wrote:
> The following patches fix 2 hard to hit races in relocation that make its
> first phase (MOVE_DATA_EXTENTS) miss extents, triggers a warning in the
> second phase (UPDATE_DATA_PTRS) and leaves metadata in an invalid state
> (file extent items pointing to areas corresponding to the deleted block
> group), leading to a BUG_ON() when attempting to read those extents after
> the relocation finishes.

Never saw this particular race/error, but decided to give these
patches a workout
to see whether they cause any new or unrelated problems.

Continuous rebalancing (full, partial) for ~30m while unpacking and
deleting kernel
trees on a 16GB tmpfs-backed loopback device did not cause any problem;
balance just cruises along at (sometimes) up to ~1GB/s and does its thing.
Finally btrfs check also found nothing wrong.

Not sure if this qualifies as testing, but anyway:

Tested-by: Holger Hoffstaette 

cheers,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot

2016-04-20 Thread Holger Hoffstätte

On Tue, 19 Apr 2016 15:19:46 -0700, Mark Fasheh wrote:

> On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote:
>> Current btrfs qgroup design implies a requirement that after calling
>> btrfs_qgroup_account_extents() there must be a commit root switch.
>> 
>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>> 
>> However there is a exception at create_pending_snapshot(), which will
>> call btrfs_qgroup_account_extents() but no any commit root switch.
>> 
>> In case of creating a snapshot whose parent root is itself (create a
>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>> (skipped unrelated data)
>> ==
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, 
>> nr_old_roots = 0, nr_new_roots = 1
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer 
>> = 0, excl = 0
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer 
>> = 16384, excl = 16384
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, 
>> nr_old_roots = 0, nr_new_roots = 0
>> ==
>> 
>> The problem here is in first qgroup_account_extent(), the
>> nr_new_roots of the extent is 1, which means its reference got
>> increased, and qgroup increased its rfer and excl.
>> 
>> But at second qgroup_account_extent(), its reference got decreased, but
>> between these two qgroup_account_extent(), there is no switch roots.
>> This leads to the same nr_old_roots, and this extent just got ignored by
>> qgroup, which means this extent is wrongly accounted.
>> 
>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>> create_pending_snapshot(), with needed preparation.
>> 
>> Reported-by: Mark Fasheh 
>> Signed-off-by: Qu Wenruo 
> 
> Ok, this version seems to be giving me the right numbers. I'll send a test
> for it shortly. I'd still like to know if this patch introduces an
> unintended side effects but otherwise, thanks Qu.
>   --Mark

Hi Mark,

Can't speak about other side effects since I have not observed any so far,
but I can confirm that the previously failing case of deleting a renamed
snapshot [1] now works properly with v4 without getting the commit roots
in a twist. So:

Tested-by: holger.hoffstae...@googlemail.com

cheers
Holger

[1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/55052

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Holger Hoffstätte

[cc: Mark and Qu]

On 04/08/16 13:51, Holger Hoffstätte wrote:
> On 04/08/16 13:14, Filipe Manana wrote:
>> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
>> patches, it didn't reproduce here:
> 
> Great, that's good to know (sort of :). Thanks also to Liu Bo.
> 
>> Are you sure that you are not using some patches not in 4.6?

We have a bingo!

Reverting "qgroup: Fix qgroup accounting when creating snapshot"
from last Wednesday immediately fixes the problem.

Was quite easy to find - the triggered WARN_ON was the second one that
complained about a mismatch between roots. The only patch that even
remotely did something in that area was said qgroup fix.

Looks like something is missing there. Suggestions welcome. :)

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Holger Hoffstätte

On 04/08/16 13:14, Filipe Manana wrote:
> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
> patches, it didn't reproduce here:

Great, that's good to know (sort of :). Thanks also to Liu Bo.

> Are you sure that you are not using some patches not in 4.6?

Quite a few, but to offset that I also left out some that have diverged
too much or were not that important (block/sectorsize, device handling).
But those should not have anything to do with this particular bug.

Except for this everything works rock-solid, I use it daily.
Should be easy to track down..

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-07 Thread Holger Hoffstätte

Hi,

Looks like I just found an exciting new corner case.
kernel 4.4.6 with btrfs ~4.6, so 4.6 should reproduce.

Try on a fresh volume:

$btrfs subvolume create foo
Create subvolume './foo'
$sync
$btrfs subvolume snapshot foo foo-1
Create a snapshot of 'foo' in './foo-1'
$sync
$mv foo-1 foo.new
$btrfs subvolume delete foo.new 
Delete subvolume (no-commit): '/mnt/test/foo.new'
$dmesg 
[  226.923316] [ cut here ]
[  226.923339] WARNING: CPU: 1 PID: 5863 at fs/btrfs/transaction.c:319 
record_root_in_trans+0xd6/0x100 [btrfs]()
[  226.923340] Modules linked in: auth_rpcgss oid_registry nfsv4 btrfs xor 
raid6_pq loop nfs lockd grace sunrpc autofs4 sch_fq_codel radeon 
snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic coretemp 
crc32_pclmul crc32c_intel aesni_intel i2c_algo_bit uvcvideo snd_hda_codec_hdmi 
aes_x86_64 drm_kms_helper videobuf2_vmalloc glue_helper videobuf2_memops 
syscopyarea lrw sysfillrect gf128mul videobuf2_v4l2 sysimgblt snd_usb_audio 
fb_sys_fops ablk_helper snd_hda_intel videobuf2_core ttm cryptd snd_hwdep 
v4l2_common usbhid snd_hda_codec snd_usbmidi_lib videodev snd_rawmidi drm 
snd_hda_core snd_seq_device i2c_i801 snd_pcm i2c_core snd_timer snd r8169 
soundcore mii parport_pc parport
[  226.923365] CPU: 1 PID: 5863 Comm: ls Not tainted 4.4.6 #1
[  226.923366] Hardware name: Gigabyte Technology Co., Ltd. 
P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
[  226.923367]   8800da677d20 813181a8 

[  226.923368]  a0aacdbf 8800da677d58 810507b2 
880601e90800
[  226.923369]  8800dacf10a0 880601e90800 880601e909f0 
0001
[  226.923371] Call Trace:
[  226.923374]  [] dump_stack+0x4d/0x65
[  226.923376]  [] warn_slowpath_common+0x82/0xc0
[  226.923378]  [] warn_slowpath_null+0x1a/0x20
[  226.923387]  [] record_root_in_trans+0xd6/0x100 [btrfs]
[  226.923395]  [] btrfs_record_root_in_trans+0x44/0x70 
[btrfs]
[  226.923404]  [] start_transaction+0x9e/0x4c0 [btrfs]
[  226.923412]  [] btrfs_join_transaction+0x17/0x20 [btrfs]
[  226.923421]  [] btrfs_dirty_inode+0x35/0xd0 [btrfs]
[  226.923430]  [] btrfs_update_time+0x7d/0xb0 [btrfs]
[  226.923432]  [] touch_atime+0x88/0xa0
[  226.923434]  [] iterate_dir+0xdb/0x120
[  226.923435]  [] SyS_getdents+0x88/0xf0
[  226.923437]  [] ? fillonedir+0xd0/0xd0
[  226.923439]  [] entry_SYSCALL_64_fastpath+0x12/0x6a
[  226.923440] ---[ end trace 9c78caf253e284fe ]---

Code looks like:

..
static int record_root_in_trans(struct btrfs_trans_handle *trans,
   struct btrfs_root *root)
{
if (test_bit(BTRFS_ROOT_REF_COWS, >state) &&
root->last_trans < trans->transid) {
WARN_ON(root == root->fs_info->extent_root);
WARN_ON(root->commit_root != root->node);
..

There's been a few journal/recovery/directory consistency patches recently,
so maybe it's a corner case or an older problem. I'll try to bisect, but
meanwhile wanted to report it for discussion.

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread Holger Hoffstätte

On 04/04/16 18:02, Filipe Manana wrote:
> I use this function frequently during development, and there's a good
> reason to use it instead of the user space tool btrfs-debug-tree.

Good to know, that's why I asked. Printing unwritten extents makes sense.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 233 matches

Mail list logo