Re: [PATCH] btrfs: Fix UAF

2018-01-28 Thread Nikolay Borisov


On 29.01.2018 04:38, Anand Jain wrote:
> 
> 
> On 01/26/2018 09:20 PM, Nikolay Borisov wrote:
>> Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
>> btrfs_free_stale_device which iterates the device lists for all
>> registered btrfs filesystems and deletes those devices which aren't
>> mounted. In a btrfs_devices structure has only 1 device attached to it
>> and it is unused then btrfs_free_stale_devices will proceed to also
>> free the btrfs_fs_devices struct itself. Currently this leads to a UAF
>> since list_for_each_entry will try to perform a check on the already-
>> freed memory to see if it has to terminated the loop.
>>
>> The fix is to use 'break' when we know we are freeing the current
>> fs_devs.
> 
>  No break is needed as we need to iterate all stale devices and delete
>  the found stale entry, so commit [1] used list_for_each_entry_safe()
>  and removed the break,

We only do the break if we know we have a single device in the current
fs_devs struct. And executing free_fs_devices would have already freed
the device + fs_devs.

> 
>  [1]
>   commit 38cf665d338fca33af4b16f9ec7cad6637fc0fec
>   Author: Anand Jain 
>     btrfs: make btrfs_free_stale_device() to iterate all stales
> 
> 
>  I am guessing UAF might be in[2], instead ?
> 
>  [2]
>     free_fs_devices(fs_devs)
> ::
>     while (!list_empty(_devices->devices)) {
>     device = list_entry(fs_devices->devices.next,
>     struct btrfs_device, dev_list);

It's not that, I thought so at first. But here using list_empty you are
awlays accessing the head the of the list, which is guaranteed to be
valid since you do the freeing of fs_devices outside of the while loop.


> 
> Thanks, Anand
> 
>> Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
>> Signed-off-by: Nikolay Borisov 
>>
>> ---
>>   fs/btrfs/volumes.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index f7147740b68e..c3ab55336ee0 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -645,6 +645,7 @@ static void btrfs_free_stale_devices(const char
>> *path,
>>   btrfs_sysfs_remove_fsid(fs_devs);
>>   list_del(_devs->list);
>>   free_fs_devices(fs_devs);
>> +    break;
>>   } else {
>>   fs_devs->num_devices--;
>>   list_del(>dev_list);
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/17] btrfs-progs: lowmem check: avoid extents overwrite

2018-01-28 Thread Su Yue

Just ping since some patches are meaningful
for further fixes.

On 01/11/2018 03:34 PM, Su Yue wrote:

This patchset can be fetched from my github:
https://github.com/Damenly/btrfs-progs/tree/lowmem
based on kdave/devel.


It can be rebased to kdave/master now but not changing
kdave/devel.

And the below solves little conflicts with kdave/devel:
https://github.com/Damenly/btrfs-progs/tree/lowmem_dev

Thanks,
Su


Thanks to Qu Wenruo's ideas and suggestions first.

Patch[1-3] fix minor problems of lowmem repair.

Patch[4-8] introduce two ways to avoid extents overwrite:
1) Traverse trees and exclude all metadata blocks.
It's time-inefficient for large filesystems.
2) Mark all existed chunks full, allocate new chunk for CoW and
records chunk start.
If the last allocated chunk is almost full, allocated a new one.
2) is More efficient than 1). However, it can't handle situations
like no space(fsck/004).
Lowmem repair will try method 2 first and then method 1.

Patch[9-17] remove parameters @trans in functions for lowmem repair.
They try to avoid extents overwrite if necessary and start
transactions by themselves.

Those patches are mainly for lowmem repair. Original mode is not
influenced.

---
Changlog:
v2->v3:
  - check_btrfs_root() returns FATAL_ERROR if check_fs_first_inode()
failed. Thanks Nikolay Borisov.
  - Add function try_to_force_cow_in_new_chunk() and global u64
varaiable to record start of the last allocated chunk.
  - Remove unused EXTENTS_PIN in enum lowmem_extents_operation.
  
v1->v2:

  - Let @err in check_btrfs_root() record err bits but excluded
negative values.
  - Do not delete a line of code to release path after extent item'
insertion in repair_extent_data_item().
  - Add patch[3].
  - Force CoW in new allocated chunk to avoid extents overwrite.
  - Remove parameters @trans in check_chunks_and_extents_v2() and
related callees.
  - Repair functions for lowmem mode call try_avoid_extents_overwrite()
and start transactions.

Su Yue (17):

   btrfs-progs: lowmem check: release path in repair_extent_data_item()
   btrfs-progs: lowmem check: record returned errors after
 walk_down_tree_v2()
   btrfs-progs: lowmem check: assign @parent early in
 repair_extent_data_item()
   btrfs-progs: lowmem check: exclude extents of metadata blocks
   btrfs-progs: lowmem check: introduce mark/clear_block_groups_full()
   btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()
   btrfs-progs: lowmem check: introduce try_avoid_extents_overwrite()
   btrfs-progs: lowmem check: exclude extents if init-extent-tree in
 lowmem
   btrfs-progs: lowmem check: start to remove parameters @trans in lowmem
   btrfs-progs: lowmem check: remove parameter @trans of
 delete_extent_item()
   btrfs-progs: lowmem check: remove parameter @trans of
 repair_chunk_item()
   btrfs-progs: lowmem check: remove parameter @trans of
 repair_extent_item()
   btrfs-progs: lowmem check: remove parameter @trans of
 check_leaf_items()
   btrfs-progs: lowmem check: remove parameter @trans of
 repair_tree_back_ref()
   btrfs-progs: lowmem check: remove parameter @trans of
 check_btrfs_root()
   btrfs-progs: lowmem check: introduce repair_block_accounting()
   btrfs-progs: lowmem check: end of removing parameters @trans in lowmem

  cmds-check.c | 710 ++-
  1 file changed, 604 insertions(+), 106 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix UAF

2018-01-28 Thread Anand Jain



On 01/26/2018 09:20 PM, Nikolay Borisov wrote:

Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also
free the btrfs_fs_devices struct itself. Currently this leads to a UAF
since list_for_each_entry will try to perform a check on the already-
freed memory to see if it has to terminated the loop.

The fix is to use 'break' when we know we are freeing the current
fs_devs.


 No break is needed as we need to iterate all stale devices and delete
 the found stale entry, so commit [1] used list_for_each_entry_safe()
 and removed the break,

 [1]
  commit 38cf665d338fca33af4b16f9ec7cad6637fc0fec
  Author: Anand Jain 
btrfs: make btrfs_free_stale_device() to iterate all stales


 I am guessing UAF might be in[2], instead ?

 [2]
free_fs_devices(fs_devs)
::
while (!list_empty(_devices->devices)) {
device = list_entry(fs_devices->devices.next,
struct btrfs_device, dev_list);

Thanks, Anand


Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov 

>

---
  fs/btrfs/volumes.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f7147740b68e..c3ab55336ee0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -645,6 +645,7 @@ static void btrfs_free_stale_devices(const char *path,
btrfs_sysfs_remove_fsid(fs_devs);
list_del(_devs->list);
free_fs_devices(fs_devs);
+   break;
} else {
fs_devs->num_devices--;
list_del(>dev_list);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/26] Add libbtrfsutil

2018-01-28 Thread Qu Wenruo


On 2018年01月27日 02:40, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Currently, users wishing to manage Btrfs filesystems programatically
> have to shell out to btrfs-progs and parse the output. This isn't ideal.
> The goal of libbtrfsutil is to provide a library version of as many of
> the operations of btrfs-progs as possible and to migrate btrfs-progs to
> use it.
> 
> Rather than simply refactoring the existing btrfs-progs code, the code
> has to be written from scratch for a couple of reasons:
> 
> * A lot of the btrfs-progs code was not designed with a nice library API
>   in mind in terms of reusability, naming, and error reporting.
> * libbtrfsutil is licensed under the LGPL, whereas btrfs-progs is under
>   the GPL, which makes it dubious to directly copy or move the code.
> 
> Eventually, most of the low-level btrfs-progs code should either live in
> libbtrfsutil or the shared kernel/userspace filesystem code, and
> btrfs-progs will just be the CLI wrapper.
> 
> This first commit just includes the build system changes, license,
> README, and error reporting helper.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  .gitignore  |   2 +
>  Makefile|  47 +--
>  libbtrfsutil/COPYING| 674 
> 
>  libbtrfsutil/COPYING.LESSER | 165 +++
>  libbtrfsutil/README.md  |  32 +++
>  libbtrfsutil/btrfsutil.h|  72 +
>  libbtrfsutil/errors.c   |  55 
>  7 files changed, 1031 insertions(+), 16 deletions(-)
>  create mode 100644 libbtrfsutil/COPYING
>  create mode 100644 libbtrfsutil/COPYING.LESSER
>  create mode 100644 libbtrfsutil/README.md
>  create mode 100644 libbtrfsutil/btrfsutil.h
>  create mode 100644 libbtrfsutil/errors.c
> 
> diff --git a/.gitignore b/.gitignore
> index 8e607f6e..272d53e4 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,6 +43,8 @@ libbtrfs.so.0.1
>  library-test
>  library-test-static
>  /fssum
> +/libbtrfsutil.so*
> +/libbtrfsutil.a
>  
>  /tests/*-tests-results.txt
>  /tests/test-console.txt
> diff --git a/Makefile b/Makefile
> index 6369e8f4..062f7f3c 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -73,6 +73,7 @@ CFLAGS = $(SUBST_CFLAGS) \
>-fPIC \
>-I$(TOPDIR) \
>-I$(TOPDIR)/kernel-lib \
> +  -I$(TOPDIR)/libbtrfsutil \
>$(EXTRAWARN_CFLAGS) \
>$(DEBUG_CFLAGS_INTERNAL) \
>$(EXTRA_CFLAGS)
> @@ -121,12 +122,14 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
> kernel-lib/rbtree.h btrfs-l
>  kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
>  kernel-lib/radix-tree.h kernel-lib/sizes.h kernel-lib/raid56.h \
>  extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h version.h
> +libbtrfsutil_version := 0.1
> +libbtrfsutil_objects = libbtrfsutil/errors.o
>  convert_objects = convert/main.o convert/common.o convert/source-fs.o \
> convert/source-ext2.o convert/source-reiserfs.o
>  mkfs_objects = mkfs/main.o mkfs/common.o
>  image_objects = image/main.o image/sanitize.o
>  all_objects = $(objects) $(cmds_objects) $(libbtrfs_objects) 
> $(convert_objects) \
> -   $(mkfs_objects) $(image_objects)
> +   $(mkfs_objects) $(image_objects) $(libbtrfsutil_objects)
>  
>  TESTS = fsck-tests.sh convert-tests.sh
>  
> @@ -248,10 +251,10 @@ static_convert_objects = $(patsubst %.o, %.static.o, 
> $(convert_objects))
>  static_mkfs_objects = $(patsubst %.o, %.static.o, $(mkfs_objects))
>  static_image_objects = $(patsubst %.o, %.static.o, $(image_objects))
>  
> -libs_shared = libbtrfs.so.0.1
> -libs_static = libbtrfs.a
> +libs_shared = libbtrfs.so.0.1 libbtrfsutil.so.$(libbtrfsutil_version)
> +libs_static = libbtrfs.a libbtrfsutil.a
>  libs = $(libs_shared) $(libs_static)
> -lib_links = libbtrfs.so.0 libbtrfs.so
> +lib_links = libbtrfs.so.0 libbtrfs.so libbtrfsutil.so.0 libbtrfsutil.so
>  headers = $(libbtrfs_headers)
>  
>  # make C=1 to enable sparse
> @@ -289,7 +292,7 @@ endif
>   $(Q)$(CC) $(STATIC_CFLAGS) -c $< -o $@ $($(subst 
> -,_,$(@:%.static.o=%)-cflags)) \
>   $($(subst -,_,btrfs-$(@:%/$(notdir $@)=%)-cflags))
>  
> -all: $(progs) libbtrfs $(BUILDDIRS)
> +all: $(progs) $(libs) $(lib_links) $(BUILDDIRS)
>  $(SUBDIRS): $(BUILDDIRS)
>  $(BUILDDIRS):
>   @echo "Making all in $(patsubst build-%,%,$@)"
> @@ -353,20 +356,31 @@ kernel-lib/tables.c:
>   @echo "[TABLE]  $@"
>   $(Q)./mktables > $@ || ($(RM) -f $@ && exit 1)
>  
> -libbtrfs: $(libs_shared) $(lib_links)
> +libbtrfs.so.0.1: $(libbtrfs_objects)
> + @echo "[LD] $@"
> + $(Q)$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LIBBTRFS_LIBS) \
> + -shared -Wl,-soname,libbtrfs.so.0 -o $@
> +
> +libbtrfs.a: $(libbtrfs_objects)
> + @echo "[AR] $@"
> + $(Q)$(AR) cr $@ $^
> +
> +libbtrfs.so.0 libbtrfs.so: libbtrfs.so.0.1
> + @echo "[LN] $@"
> + $(Q)$(LN_S) -f $< $@
>  
> -$(libs_shared): $(libbtrfs_objects) 

Re: btrfs check: backref lost, mismatch with its hash -- can't repair

2018-01-28 Thread Qu Wenruo


On 2018年01月26日 23:15, ^m'e wrote:
> On Fri, Jan 26, 2018 at 12:16 PM, Qu Wenruo  wrote:
>> Branch updated, problem in 1399 should be fixed.
>>
>> Seems the remaining problems are less and less now.
>>
>> Thanks,
>> Qu
>>
> 
> Great! The fix worked, and repair goes throught at last :-) though
> we're still left with some inconsistencies at check:
> (1st run)
> --
> checking fs roots
> ERROR: root 1385 INODE[30039323] is orphan item
> ERROR: root 1385 INODE[18446744073709551361] is orphan item
> ERROR: root 1399 DIR INODE [30039322] size 99 not equal to 96

As you could see, dir isize can be fixed by both repair mode.
(Just in case, lowmem mode also supports repair)

> ERROR: root 1399 INODE[18446744073709551361] is orphan item
> ERROR: errors found in fs roots
> ...
> --
> 
> repair (1st run)
> --
> Fixed 0 roots.
> checking extents
> checking free space cache
> checking fs roots
> warning line 4682
> checking csums
> checking root refs
> enabling repair mode
> Checking filesystem on /dev/sdb3
> UUID: de1723e2-150c-4448-bb36-be14d7d96093
> No device size related problem found
> cache and super generation don't match, space cache will be invalidated
> reset isize for dir 30039322 root 1399
> found 104509362176 bytes used, no error found
> total csum bytes: 99690492
> total tree bytes: 2394259456
> total fs tree bytes: 2173468672
> total extent tree bytes: 87900160
> btree space waste bytes: 574264314
> file data blocks allocated: 190189735936
>  referenced 150679924736
> --
> 
> check (2nd run)
> --
> checking fs roots
> ERROR: root 1385 INODE[30039323] is orphan item
> ERROR: root 1385 INODE[18446744073709551361] is orphan item
> ERROR: root 1399 INODE[18446744073709551361] is orphan item

I'm not sure if the orphan item report is necessary for lowmem mode.

IIRC orphan item is completely valid.

I'll dig it further to fit it.

> ERROR: errors found in fs roots
> Checking filesystem on /dev/sdb3
> UUID: de1723e2-150c-4448-bb36-be14d7d96093
> cache and super generation don't match, space cache will be invalidated
> found 104509362176 bytes used, error(s) found
> total csum bytes: 99690492
> total tree bytes: 14561427456
> total fs tree bytes: 14340636672
> total extent tree bytes: 87900160
> btree space waste bytes: 3163391144
> file data blocks allocated: 362059718656
>  referenced 315586768896
> --
> 
> repair (2nd run)
> --
> Fixed 0 roots.
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> enabling repair mode
> Checking filesystem on /dev/sdb3
> UUID: de1723e2-150c-4448-bb36-be14d7d96093
> No device size related problem found
> cache and super generation don't match, space cache will be invalidated
> found 104509362176 bytes used, no error found
> total csum bytes: 99690492
> total tree bytes: 2394259456
> total fs tree bytes: 2173468672
> total extent tree bytes: 87900160
> btree space waste bytes: 574264314
> file data blocks allocated: 190189735936
>  referenced 150679924736
> --
> 
> It looks like it didn't detect anything bad...
> As a side note, check in "original" mode doesn't detect the issues as
> in runs reported above.
> 
> Debugging:
> --
> # ./btrfs.static inspect dump-tree -t 1385 /dev/sdb3 | grep -C 20
> 18446744073709551361
> item 45 key (47329988 INODE_REF 256) itemoff 9768 itemsize 20
> index 28 namelen 10 name: lost+found
> item 46 key (47329988 DIR_ITEM 2438219243) itemoff 9726 itemsize 42
> location key (30039324 INODE_ITEM 0) type FILE
> transid 0 data_len 0 name_len 12
> name: metadata.xml
> item 47 key (47329988 DIR_INDEX 2) itemoff 9684 itemsize 42
> location key (30039324 INODE_ITEM 0) type FILE
> transid 0 data_len 0 name_len 12
> name: metadata.xml
> item 48 key (MULTIPLE INODE_ITEM 0) itemoff 9524 itemsize 160
> generation 461639 transid 0 size 0 nbytes 0
> block group 0 mode 100700 links 0 uid 0 gid 0 rdev 0
> sequence 0 flags 0x0(none)
> atime 1516888573.0 (2018-01-25 13:56:13)
> ctime 1516888573.0 (2018-01-25 13:56:13)
> mtime 1516888573.0 (2018-01-25 13:56:13)
> otime 0.0 (1970-01-01 00:00:00)
> item 49 key (ORPHAN ORPHAN_ITEM 30039323) itemoff 9524 itemsize 0
> orphan item
> item 50 key (ORPHAN ORPHAN_ITEM 

Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sun, Jan 28, 2018 at 3:39 PM, Tomasz Pala  wrote:
> On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:
>
>>> Tell me please, if you mount -o degraded btrfs - what would
>>> BTRFS_IOC_DEVICES_READY return?
>>
>> case BTRFS_IOC_DEVICES_READY:
>> ret = btrfs_scan_one_device(vol->name, FMODE_READ,
>> _fs_type, _devices);
>> if (ret)
>> break;
>> ret = !(fs_devices->num_devices == fs_devices->total_devices);
>> break;
>>
>>
>> All it cares about is whether the number of devices found is the same
>> as the number of devices any of that volume's supers claim make up
>> that volume. That's it.
>>
>>> This is not "outsmarting" nor "knowing better", on the contrary, this is 
>>> "FOLLOWING the
>>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>>> that should change to READY state *if* someone has tried and apparently
>>> succeeded mounting the not-ready volume.
>>
>> Nope. That is not what the ioctl does.
>
> So who is to blame for creating utterly useless code? Userspace
> shouldn't depend on some stats (as number of devices is nothing more
> than that), but overall _availability_.

There's quite a lot missing. Btrfs doesn't even really have a degraded
state concept. It has a degraded mount option, but this is not a
state. e.g. if you have a normally mounted volume, and a drive dies or
vanishes, there's no way for the user to know the array is degraded.
They can only infer that it's degraded by a.) metric f tons of
read/write errors to a bdev b.) the application layer isn't pissed off
about it; or in lieu of a. they see via 'btrfs fi show' that a device
is missing. Likewise, when a device is failing to read and write,
Btrfs doesn't consider it faulty and boot it out of the array, it just
keeps on trying, the spew of which can cause disk contention of those
errors are written to a log on spinning rust.

Anyway, the fact many state features are missing doesn't mean the
necessary information to do the right thing is missing.



> I do not care if there are 2, 5 or 100 devices. I do care if there is
> ENOUGH devices to run regular (including N-way mirroring and hot spares)
> and if not - if there is ENOUGH devices to run degraded. Having ALL the
> devices is just the edge case.

systemd can't possibly need to know more information than a person
does in the exact same situation in order to do the right thing. No
human would wait 10 minutes, let alone literally the heat death of the
planet for "all devices have appeared" but systemd will. And it does
that by its own choice, its own policy. That's the complaint. It's
choosing to do something a person wouldn't do, given identical
available information. There's nothing the kernel is doing that's
telling systemd to wait for goddamn ever.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote:

>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> "not ready" is rather vague terminology but yes that's how systemd
> ends up using the ioctl this rule depends on, even though the rule has
> nothing to do with readiness per se. If all devices for a volume

If you avoid using THIS ioctl, then you'd have nothing to fire the rule
at all. One way or another, this is btrfs that must emit _some_ event or
be polled _somehow_.

> aren't found, we can correctly conclude a normal mount attempt *will*
> fail. But that's all we can conclude. What I can't parse in all of
> this is if the udev rule is a one shot, if the ioctl is a one shot, if
> something is constantly waiting for "not all devices are found" to
> transition to "all devices are found" or what. I can't actually parse

It's not one shot. This works like this:

sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready
sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready

The end.

If there were some other device appearing after assembly, like /dev/md1,
or if there were some event generated by btrfs code itself, udev could
catch this and follow. Now, if you unplug sdb1, there's no such event at
all.

Since this IOCTL is the *only* thing that udev can rely on, it cannot be
removed from the logic. So even if you create a timer to force assembly,
you must do it by influencing the IOCTL response.

Or creating some other IOCTL for this purpose, or creating some
userspace daemon or whatever.

> the two critical lines in this rule. I
> 
> # let the kernel know about this btrfs filesystem, and check if it is complete
> IMPORT{builtin}="btrfs ready $devnode"

This sends IOCTL.

> # mark the device as not ready to be used by the system
> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
  ^^this is IOCTL response being checked

and SYSTEMD_READY set to 0 prevents systemd from mounting.

> I think the Btrfs ioctl is a one shot. Either they are all present or not.

The rules are called once per (block) device.
So when btrfs scans all the devices to return READY, this would finally
be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger),
but there is no way to force btrfs to return READY after any timeout.

> The waiting is a policy by systemd udev rule near as I can tell.

There is no problem in waiting or re-triggering. This can be done in ~10
lines of rules. The problem is that the IOCTL won't EVER return READY until
there are ALL the components present.

It's simple as that: there MUST be some mechanism at device-manager
level that tells if a compound device is mountable, degraded or not;
upper layers (systemd-mount) do not care about degradation, handling
redundancy/mirrors/chunks/stripes/spares is not it's job.
It (systemd) can (easily!) handle expiration timer to push pending
compound to be force-assembled, but currently there is no way to push.


If the IOCTL would be extended to return TRYING_DEGRADED (when
instructed to do so after expired timeout), systemd could handle
additional per-filesystem fstab options, like x-systemd.allow-degraded.

Then in would be possible to have best-effort policy for rootfs (to make
machine boot), and more strict one for crucial data (do not mount it
when there is no redundancy, wait for operator intervention).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:

>> Tell me please, if you mount -o degraded btrfs - what would
>> BTRFS_IOC_DEVICES_READY return?
> 
> case BTRFS_IOC_DEVICES_READY:
> ret = btrfs_scan_one_device(vol->name, FMODE_READ,
> _fs_type, _devices);
> if (ret)
> break;
> ret = !(fs_devices->num_devices == fs_devices->total_devices);
> break;
> 
> 
> All it cares about is whether the number of devices found is the same
> as the number of devices any of that volume's supers claim make up
> that volume. That's it.
>
>> This is not "outsmarting" nor "knowing better", on the contrary, this is 
>> "FOLLOWING the
>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>> that should change to READY state *if* someone has tried and apparently
>> succeeded mounting the not-ready volume.
> 
> Nope. That is not what the ioctl does.

So who is to blame for creating utterly useless code? Userspace
shouldn't depend on some stats (as number of devices is nothing more
than that), but overall _availability_.

I do not care if there are 2, 5 or 100 devices. I do care if there is
ENOUGH devices to run regular (including N-way mirroring and hot spares)
and if not - if there is ENOUGH devices to run degraded. Having ALL the
devices is just the edge case.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sun, Jan 28, 2018 at 1:06 AM, Andrei Borzenkov  wrote:
> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
 On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even
>> with degraded option in fstab and grub, unable to boot !  The boot
>> process stop on initramfs.
>>
>> Is there a solution to boot with systemd and degraded array ?
>
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
>
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

"not ready" is rather vague terminology but yes that's how systemd
ends up using the ioctl this rule depends on, even though the rule has
nothing to do with readiness per se. If all devices for a volume
aren't found, we can correctly conclude a normal mount attempt *will*
fail. But that's all we can conclude. What I can't parse in all of
this is if the udev rule is a one shot, if the ioctl is a one shot, if
something is constantly waiting for "not all devices are found" to
transition to "all devices are found" or what. I can't actually parse
the two critical lines in this rule. I


$ cat /usr/lib/udev/rules.d/64-btrfs.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!="block", GOTO="btrfs_end"
ACTION=="remove", GOTO="btrfs_end"
ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end"

# let the kernel know about this btrfs filesystem, and check if it is complete
IMPORT{builtin}="btrfs ready $devnode"

# mark the device as not ready to be used by the system
ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"

LABEL="btrfs_end"




And udev builtin btrfs, which I guess the above rule is referring to:

https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-btrfs.c

I think the Btrfs ioctl is a one shot. Either they are all present or
not. The waiting is a policy by systemd udev rule near as I can tell.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sat, Jan 27, 2018 at 5:39 PM, Tomasz Pala  wrote:
> On Sat, Jan 27, 2018 at 15:22:38 +, Duncan wrote:
>
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done, instead of insisting it knows what's going on better than
>> the kernel does, and immediately umounting it.
>
> Tell me please, if you mount -o degraded btrfs - what would
> BTRFS_IOC_DEVICES_READY return?


case BTRFS_IOC_DEVICES_READY:
ret = btrfs_scan_one_device(vol->name, FMODE_READ,
_fs_type, _devices);
if (ret)
break;
ret = !(fs_devices->num_devices == fs_devices->total_devices);
break;


All it cares about is whether the number of devices found is the same
as the number of devices any of that volume's supers claim make up
that volume. That's it.


>
> This is not "outsmarting" nor "knowing better", on the contrary, this is 
> "FOLLOWING the
> kernel-returned data". The umounting case is simply a bug in btrfs.ko
> that should change to READY state *if* someone has tried and apparently
> succeeded mounting the not-ready volume.


Nope. That is not what the ioctl does.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix err_cast.cocci warnings

2018-01-28 Thread kbuild test robot
From: Fengguang Wu 

fs/btrfs/volumes.c:742:10-17: WARNING: ERR_CAST can be used with fs_devices


 Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))

Generated by: scripts/coccinelle/api/err_cast.cocci

Fixes: bf155c98d312 ("btrfs: get device pointer from device_list_add()")
CC: Anand Jain 
Signed-off-by: Fengguang Wu 
---

 volumes.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -739,7 +739,7 @@ static noinline struct btrfs_device *dev
if (!fs_devices) {
fs_devices = alloc_fs_devices(disk_super->fsid);
if (IS_ERR(fs_devices))
-   return ERR_PTR(PTR_ERR(fs_devices));
+   return ERR_CAST(fs_devices);
 
list_add(_devices->list, _uuids);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov
28.01.2018 18:57, Duncan пишет:
> Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:
> 
>> 27.01.2018 18:22, Duncan пишет:
>>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>>
 On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>
>>> I just tested to boot with a single drive (raid1 degraded), even
>>> with degraded option in fstab and grub, unable to boot !  The boot
>>> process stop on initramfs.
>>>
>>> Is there a solution to boot with systemd and degraded array ?
>>
>> No. It is finger pointing. Both btrfs and systemd developers say
>> everything is fine from their point of view.

 It's quite obvious who's the culprit: every single remaining rc system
 manages to mount degraded btrfs without problems.  They just don't try
 to outsmart the kernel.
>>>
>>> No kidding.
>>>
>>> All systemd has to do is leave the mount alone that the kernel has
>>> already done,
>>
>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> As Tomaz indicates, I'm talking about manual mounting (after the initr* 
> drops to a maintenance prompt if it's root being mounted, or on manual 
> mount later if it's an optional mount) here.  The kernel accepts the 
> degraded mount and it's mounted for a fraction of a second, but systemd 
> actually undoes the successful work of the kernel to mount it, so by the 
> time the prompt returns and a user can check, the filesystem is unmounted 
> again, with the only indication that it was mounted at all being the log.
> 

This is fixed in current systemd (actually for quite some time). If you
still observe it with more or less recent systemd, report a bug.

> He says that's because the kernel still says it's not ready, but that's 
> for /normal/ mounting.  The kernel accepted the degraded mount and 
> actually mounted the filesystem, but systemd undoes that.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Duncan
Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:

> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>> 
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
 On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even
>> with degraded option in fstab and grub, unable to boot !  The boot
>> process stop on initramfs.
>>
>> Is there a solution to boot with systemd and degraded array ?
>
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>> 
>> No kidding.
>> 
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

As Tomaz indicates, I'm talking about manual mounting (after the initr* 
drops to a maintenance prompt if it's root being mounted, or on manual 
mount later if it's an optional mount) here.  The kernel accepts the 
degraded mount and it's mounted for a fraction of a second, but systemd 
actually undoes the successful work of the kernel to mount it, so by the 
time the prompt returns and a user can check, the filesystem is unmounted 
again, with the only indication that it was mounted at all being the log.

He says that's because the kernel still says it's not ready, but that's 
for /normal/ mounting.  The kernel accepted the degraded mount and 
actually mounted the filesystem, but systemd undoes that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 01:00:16 +0100, Tomasz Pala wrote:

> It can't mount degraded, because the "missing" device might go online a
> few seconds ago.

s/ago/after/

>> The central problem is the lack of a timer and time out.
> 
> You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
> anything, as you all state here, this should be easy to make this work.
> Go ahead please.

And just to make it even easier - this is how you can react to events
inside udev (this is to eliminane btrfs-scan tool being required as it sux):

https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

One could even try to trick systemd by SETTING (note the single '=')

ENV{ID_BTRFS_READY}="0"

- which would probably break as soon as btrfs.ko emits next 'changed' event.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 11:06:06 +0300, Andrei Borzenkov wrote:

>> All systemd has to do is leave the mount alone that the kernel has 
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

And even after successful manual mount (with -o degraded) btrfs.ko
insists that the device is not ready.

That schizophrenia makes systemd umount that immediately, because this
is the only proper way to handle missing devices (only the failed ones
should go r/o). And there is really nothing systemd can do about this,
until underlying code stops lying, unless we're going back to 1990s when
devices were never unplugged or detached during system uptime. But even
floppies could be ejected without system reboot.

BTRFS is no exception here - when marked as 'not available',
don't expect it to be kept used. Just fix the code to match reality.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov
27.01.2018 18:22, Duncan пишет:
> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
> 
>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>
> I just tested to boot with a single drive (raid1 degraded), even
> with degraded option in fstab and grub, unable to boot !  The boot
> process stop on initramfs.
>
> Is there a solution to boot with systemd and degraded array ?

 No. It is finger pointing. Both btrfs and systemd developers say
 everything is fine from their point of view.
>>
>> It's quite obvious who's the culprit: every single remaining rc system
>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done,

Are you sure you really understand the problem? No mount happens because
systemd waits for indication that it can mount and it never gets this
indication.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html