cannot delete subvolumes
Hi all, Our specifications are: - Debian 7.7 - kernel version: 3.17.4 Two weeks ago we started a btrfs balance. Some days later we stopped the balance. The messages log show some errors (see below). While balancing we removed some snapshots, this operations failed with a error messages. For example: ERROR: error accessing '/mnt/btrfs/snapshots/home/2014-10-23--20-31-18-@daily' Now there are some subvolumes without a owner, last modified date etc. . Additionally the ls command returns an error: stuelpner@hsad-srv-03:/boot$ ls -l /mnt/btrfs/snapshots/home/ ls: cannot access /mnt/btrfs/snapshots/home/2014-10-23--20-31-18-@daily: Cannot allocate memory total 0 drwxr-xr-x 1 root root 952 May 20 2014 2014-06-30--18-49-09-@monthly drwxr-xr-x 1 root root 952 May 20 2014 2014-07-01--02-52-01-@monthly drwxr-xr-x 1 root root 952 May 20 2014 2014-08-01--02-52-01-@monthly drwxr-xr-x 1 root root 990 Aug 28 07:00 2014-09-01--02-52-01-@monthly drwxr-xr-x 1 root root 990 Aug 28 07:00 2014-10-01--02-52-02-@monthly d? ? ? ? ? ? 2014-10-23--20-31-18-@daily drwxr-xr-x 1 root root 996 Oct 21 12:38 2014-11-01--02-52-01-@monthly This subvolumes cannot be access nor deleted: ~# btrfs subvolume delete /mnt/btrfs/snapshots/transfer/2014-10-21--20-50-44-\@daily Transaction commit: none (default) ERROR: error accessing '/mnt/btrfs/snapshots/transfer/2014-10-21--20-50-44-@daily' Filesystem info: ~# btrfs fi show Label: 'btrfs-data' uuid: 47a6ce34-6b63-4202-a7da-c1f6dbe48676 Total devices 4 FS bytes used 2.22TiB devid 1 size 2.73TiB used 2.42TiB path /dev/sda devid 2 size 2.73TiB used 2.42TiB path /dev/sdb devid 3 size 2.73TiB used 186.00GiB path /dev/sdc devid 4 size 2.73TiB used 186.00GiB path /dev/sdd Label: 'spectral-data' uuid: aaaba9e0-7710-4295-88b1-c0ee9bd2eff8 Total devices 2 FS bytes used 640.00KiB devid 1 size 238.47GiB used 2.03GiB path /dev/sdg devid 2 size 238.47GiB used 2.01GiB path /dev/sdh Btrfs v3.12 root@hsad-srv-03:~# btrfs fi df /mnt/btrfs Data, RAID1: total=2.32TiB, used=2.06TiB Data, single: total=8.00MiB, used=0.00B System, RAID1: total=8.00MiB, used=400.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=287.00GiB, used=159.31GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B message.log: Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323328] btrfs D 88083ed92dc0 0 7313 7296 0x Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323335] 88068d816310 0082 88083ae971d0 0086 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323340] 00012dc0 00012dc0 88068d816310 88058028ffd8 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323345] 88083ec12dc0 7fff 7fff 0002 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323351] Call Trace: Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323364] [] ? console_conditional_schedule+0xf/0xf Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323372] [] ? schedule_timeout+0x1c/0xf7 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323378] [] ? __queue_work+0x1ef/0x23d Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323384] [] ? __wait_for_common+0x120/0x158 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323390] [] ? try_to_wake_up+0x1c7/0x1c7 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323441] [] ? btrfs_async_run_delayed_refs+0xbd/0xda [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323475] [] ? __btrfs_end_transaction+0x2b3/0x2d6 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323512] [] ? relocate_block_group+0x2a1/0x4cd [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323548] [] ? btrfs_relocate_block_group+0x14f/0x267 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323584] [] ? btrfs_relocate_chunk.isra.58+0x58/0x5e2 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323609] [] ? btrfs_item_key_to_cpu+0x12/0x30 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323645] [] ? btrfs_get_token_64+0x76/0xc6 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323681] [] ? release_extent_buffer+0x9d/0xa4 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323717] [] ? btrfs_balance+0x9b0/0xb9d [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323752] [] ? btrfs_ioctl_balance+0x21a/0x297 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323788] [] ? btrfs_ioctl+0x1134/0x20e5 [btrfs] Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323794] [] ? path_openat+0x233/0x4c5 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323801] [] ? __do_page_fault+0x339/0x3df Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323807] [] ? vma_link+0x6b/0x8a Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323813] [] ? do_vfs_ioctl+0x3ed/0x436 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323819] [] ? SyS_ioctl+0x49/0x77 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323824] [] ? page_fault+0x22/0x30 Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323829] [] ? system_call_fastpath+0x16/0x1b Do you have an idea? Can you help us please? Rega
Re: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.
Hi Goffredo On 12/08/2014 03:02 AM, Qu Wenruo wrote: Original Message Subject: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices. From: Goffredo Baroncelli To: Date: 2014年12月05日 02:39 LVM snapshots create a problem to the btrfs devices management. BTRFS assumes that each device haw an unique 'device UUID'. A LVM snapshot breaks this assumption. This patch skips LVM snapshots during the device scan phase. If you need to consider a LVM snapshot you have to set the environmental variable BTRFS_SKIP_LVM_SNAPSHOT to "no". IMHO, it is better only to skip LVM snapshot if and only if the snapshot contains a btrfs with conflicting UUID. Hi Qu, Currently the "scan phase" in btrfs is done 1 device at time. (udev finds a new device and starts "btrfs dev scan "), and I haven't changed that. This means that btrfs[-prog] doesn't know which devices are already registered, or not. And if even it would know this information, you have to consider the case where the snapshot appears before the real target. So btrfs[-prog] is not in position to perform this analysis [see below my other comment] Since LVM is such a flexible block level volume management, it is possible that some one did a snapshot of a btrfs fs, and then reformat the original one to other fs. In that case, the LVM snapshot skip seems overkilled. Also, personally, I prefer there will be some option like -i to allow user to choose which device is used when conflicting uuid is detected. This seems to be the best case and user can have the full control on device scan. This also makes the environment variant not needed. LVM snapshot skip (only when conflicting) is better to be the fallback behavior if -i is not given. I understood your reasons, but I don't find any solution compatible with the current btrfs device registration model (asynchronously when the device appears). In another patch set, I proposed a mount.btrfs helper which is in position to perform these analysis and to pick the "right" device (even with the user suggestion). Today the lvm-snapshot and btrfs behave very poor: it is not predictable which device is pick (the original or the snapshot). These patch *avoid* most problems skipping the snapshots, which to me seems a reasonable default. For the other case the user is still able to mount any disks [combination] passing them directly via command line ( mount /dev/sdX -o device=/dev/sdY,device=/dev/sdz... ); Anyway I think for these kind of setup (btrfs on lvm-snapshot), passing the disks explicitly is the only solution; in fact your suggestion about the '-i' switch is not very different. Thanks, Qu BR G.Baroncelli Your explains sounds quite reasonable, it's good enough before any better solutions. Thanks, Qu To check if a device is a LVM snapshot, it is checked the 'udev' device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, the device has to be skipped. As conseguence, btrfs now depends by libudev. Programmatically you can control this behavior with the functions: - btrfs_scan_set_skip_lvm_snapshot(int new_value) - int btrfs_scan_get_skip_lvm_snapshot( ) Signed-off-by: Goffredo Baroncelli --- Makefile | 4 +-- utils.c | 107 +++ utils.h | 9 +- 3 files changed, 117 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 4cae30c..9464361 100644 --- a/Makefile +++ b/Makefile @@ -26,7 +26,7 @@ TESTS = fsck-tests.sh convert-tests.sh INSTALL = install prefix ?= /usr/local bindir = $(prefix)/bin -lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L. +lib_LIBS = -luuid -lblkid -lm -lz -ludev -llzo2 -L. libdir ?= $(prefix)/lib incdir = $(prefix)/include/btrfs LIBS = $(lib_LIBS) $(libs_static) @@ -99,7 +99,7 @@ lib_links = libbtrfs.so.0 libbtrfs.so headers = $(libbtrfs_headers) # make C=1 to enable sparse -check_defs := .cc-defines.h +check_defs := .cc-defines.h ifdef C # # We're trying to use sparse against glibc headers which go wild diff --git a/utils.c b/utils.c index 2a92416..9887f8b 100644 --- a/utils.c +++ b/utils.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -52,6 +53,13 @@ #define BLKDISCARD_IO(0x12,119) #endif +/* + * This variable controls if the lvm snapshot have to be skipped or not. + * Access this variable only via the btrfs_scan_[sg]et_skip_lvm_snapshot() + * functions + */ +static int __scan_device_skip_lvm_snapshot = -1; + static int btrfs_scan_done = 0; static char argv0_buf[ARGV0_BUF_SIZE] = "btrfs"; @@ -1593,6 +1601,9 @@ int btrfs_scan_block_devices(int run_ioctl) char fullpath[110]; int scans = 0; int special; +int skip_snapshot; + +skip_snapshot = btrfs_scan_get_skip_lvm_snapshot(); scan_again: proc_partitions = fopen("/proc/partitions","r"); @@ -1642,6 +1653,9 @@ scan_again: continue;
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
On 12/08/2014 02:38 PM, Konstantin wrote: For more important systems there are high availability solutions which alleviate many of the problems you mention of but that's not the point here when speaking about the major bug in BTRFS which can make your system crash. I think you missed the part where I told you that you could use GRUB2 and then you could use the 1.2 metadata on your raid and then have you system work as desired. Trying to make this all about BTRFS is more than a touch disingenuous as you are doing things that can make many systems fail in exactly the same way. Undefined behavior is undefined. The MDADM people made the latter metadata layouts to address your issue, and its up to you to use it. Need it to boot, GRUB2 will boot it, and it's up to you to use it. New software fixes problems evident in the old, but once you decide to stick with the old despite the new, your problem becomes uninteresting because it was already fixed. So yes, if you use the woefully out of date metadata and boot loader you will have problems. If you use the distro scripts that scan the volumes you don't want scanned, you will have problems. People are working on making sure that those problems have work arounds. And sometimes the work around for "doctor, it hurts when I do this" is "don't do that any more". It is multiplicatively impossible to build BTRFS such that it can dance through the entire Cartesian Product of all possible storage management solutions. Just as it was impossible for LVM and MDADM before them. If your system is layered, _you_ bear the burden of making sure that the layers are applied. Each tool is evolving to help you, but its still you doing the system design. GRUB has put in modules for everything you need (so far) to boot. mdadm has better signatures if you use them. LVM always had device offsets built into its metadata block. But answering the negative. The sort of question that might be phrased "how do you know it's _not_ mdadm old style signatures" is an unbounded coding, not because any one is impossible to code, but because an endless stream of possibilities is coming in the pipe. A striped storage controller might make a system look like /dev/sdb is a stand-alone BTRFS file system if the controller doesn't start and the mdadm and lvm signatures are on /dev/sda and take up "just the right amount of room". If I do an mkfs.ext2 on a media, then do a cryptsetup luksCreate on that same media, I can mount it either way, with disastrous consequences for the other semantic layout. The bad combinations available are virtually limitless. There comes a point where the System Architect that decided how to build the individual system has to take responsibility for his actions. Note that the same "it didn't protect me" errors can happen _easily_ with other filesystems. Try building an NTFS on a disk, then build an ext4 on the same disk then mount as either or both. (though now days you may need to build the ext4 then the NTFS since I think mkfs.ext4 may now have a little dedicated wiper to de-NTFS a disk after that went sour a few too many times). When storage signatures conflict you will get "exciting" outcomes. It will always be that way, and its not an "error" in any of that filesystem code. You, the System Architect, bear a burden here. The system isn't shooting "itself" when you do certain things. The System Architect is shooting the system with a bad layout bullet. You don't want some LV to be scanned... don't scan it... If your tools scan it automatically, don't use those tools that way. "But my distro automatically" is just a reason to look twice at your distro or your design. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is the actual disk usage of btrfs considered unknowable?
On Mon, Dec 08, 2014 at 03:47:23PM +0100, Martin Steigerwald wrote: > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > > On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > > Almost full filesystems are their own reward. > > So you basically say that BTRFS with compression does not meet the fallocate > guarantee. Now thats interesting, cause it basically violates the > documentation for the system call: > > DESCRIPTION >The function posix_fallocate() ensures that disk space is allo‐ >cated for the file referred to by the descriptor fd for the bytes >in the range starting at offset and continuing for len bytes. >After a successful call to posix_fallocate(), subsequent writes >to bytes in the specified range are guaranteed not to fail >because of lack of disk space. > > So in order to be standard compliant there, BTRFS would need to write > fallocated files uncompressed… wow this is getting complex. ...and nodatacow and no snapshots, since those require more space that was never anticipated by fallocate. Given the choice, I'd just let fallocate fail. Usually when I come across a program using fallocate, I end up patching it so it doesn't use fallocate any more. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
Robert White schrieb am 08.12.2014 um 18:20: > On 12/07/2014 04:32 PM, Konstantin wrote: >> I know this and I'm using 0.9 on purpose. I need to boot from these >> disks so I can't use 1.2 format as the BIOS wouldn't recognize the >> partitions. Having an additional non-RAID disk for booting introduces a >> single point of failure which contrary to the idea of RAID>0. > > GRUB2 has raid 1.1 and 1.2 metadata support via the mdraid1x module. > LVM is also supported. I don't know if a stack of both is supported. > > There is, BTW, no such thing as a (commodity) computer without a > single point of failure in it somewhere. I've watched government > contracts chase this demon for decades. Be it disk, controller, > network card, bus chip, cpu or stick-of-ram you've got a single point > of failure somewhere. Actually you likely have several such points of > potential failure. > > For instance, are you _sure_ your BIOS is going to check the second > drive if it gets read failure after starting in on your first drive? > Chances are it won't because that four-hundred bytes-or-so boot loader > on that first disk has no way to branch back into the bios. > > You can waste a lot of your life chasing that ghost and you'll still > discover you've missed it and have to whip out your backup boot media. > > It may well be worth having a second copy of /boot around, but make > sure you stay out of bandersnatch territory when designing your > system. "The more you over-think the plumbing, the easier it is to > stop up the pipes." You are right, there is as good as always a single point of failure somewhere, even if it is the power plant providing your electricity ;-). I should have written "introduces an additional single point of failure" to be 100% correct but I thought this was obvious. As I have replaced dozens of damaged hard disks but only a few CPUs, RAMs etc. it is more important for me to reduce the most frequent and easy-to-solve points of failure. For more important systems there are high availability solutions which alleviate many of the problems you mention of but that's not the point here when speaking about the major bug in BTRFS which can make your system crash. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] dm: log writes target
This is my latest attempt at a target for testing power fail and fs consistency. This is based on the idea Zach Brown had where we could just walk through all the operations done to an fs in order to verify we're doing the correct thing. There is a userspace component as well that can be found here https://github.com/josefbacik/log-writes It is very rough as I just threw it together to test the various aspects of how you would want to replay a log to test it. Again I would love feedback on this, I really want to have something that we all think is usefull and eventually incorporate it into xfstests. From: Josef Bacik Subject: [PATCH]dm: log writes target This creates a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and the data and log the requests and the data to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers to verify that the file system is always consistent. Thanks, Signed-off-by: Josef Bacik --- Documentation/device-mapper/dm-log-writes.txt | 136 + drivers/md/Kconfig| 16 + drivers/md/Makefile | 1 + drivers/md/dm-log-writes.c| 835 ++ 4 files changed, 988 insertions(+) create mode 100644 Documentation/device-mapper/dm-log-writes.txt create mode 100644 drivers/md/dm-log-writes.c diff --git a/Documentation/device-mapper/dm-log-writes.txt b/Documentation/device-mapper/dm-log-writes.txt new file mode 100644 index 000..f3a9fa2 --- /dev/null +++ b/Documentation/device-mapper/dm-log-writes.txt @@ -0,0 +1,136 @@ +dm-log-writes += + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_writes_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering + + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_FLUSH request. This is to make it easier for userspace to replay the +log in a way that correlates to what is on disk and not what is in cache, to +make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_FLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITE's and then the FLUSH. Only +completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order +to simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete) + +W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following + +W3,W2,flush,W1 + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_DISCARD requests are treated like WRITE requests. This is because +otherwise we would have all the DISCARD requests, and then the WRITE requests +and then the FLUSH request. Consider the following example + +WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this + +DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Marks += + +You can use dmsetup to set an arbitrary mark in a log. For example say you want +to fsck an file system after every write, but first you need to replay up to the +mkfs to make sure we're fsck'ing something reasonable, you would do something +like this + +mkfs.btrfs -f /dev/mapper/log +dmsetup message log 0 mark mkfs + + +This would allow you to replay the log up to the mkfs mark and then replay from +that point on doing the fsck check in the interval that you want. + +Every log has a mark at the end labeled "log-writes-end". + +Userspace component +=== + +There is a userspace tool that will replay the log for you in various ways. +As of this writing the options are not well documented, they will be in the +future. It can be found here + +https://github.com/josefbacik/log-writes + +Example usage += + +Say you want to test fsync on your file system. You would do something like +thi
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
Phillip Susi schrieb am 08.12.2014 um 15:59: > On 12/7/2014 7:32 PM, Konstantin wrote: > >> I'm guessing you are using metadata format 0.9 or 1.0, which put > >> the metadata at the end of the drive and the filesystem still > >> starts in sector zero. 1.2 is now the default and would not have > >> this problem as its metadata is at the start of the disk ( well, > >> 4k from the start ) and the fs starts further down. > > I know this and I'm using 0.9 on purpose. I need to boot from > > these disks so I can't use 1.2 format as the BIOS wouldn't > > recognize the partitions. Having an additional non-RAID disk for > > booting introduces a single point of failure which contrary to the > > idea of RAID>0. > > The bios does not know or care about partitions. All you need is a That's only true for older BIOSs. With current EFI boards they not only care but some also mess around with GPT partition tables. > partition table in the MBR and you can install grub there and have it > boot the system from a mdadm 1.1 or 1.2 format array housed in a > partition on the rest of the disk. The only time you really *have* to I was thinking of this solution as well but as I'm not aware of any partitioning tool caring about mdadm metadata so I rejected it. It requires a non-standard layout leaving reserved empty spaces for mdadm metadata. It's possible but it isn't documented so far I know and before losing hours of trying I chose the obvious one. > use 0.9 or 1.0 ( and you really should be using 1.0 instead since it > handles larger arrays and can't be confused vis. whole disk vs. > partition components ) is if you are running a raid1 on the raw disk, > with no partition table and then partition inside the array instead, > and really, you just shouldn't be doing that. That's exactly what I want to do - running RAID1 on the whole disk as most hardware based RAID systems do. Before that I was running RAID on disk partitions for some years but this was quite a pain in comparison. Hot(un)plugging a drive brings you a lot of issues with failing mdadm commands as they don't like concurrent execution when the same physical device is affected. And rebuild of RAID partitions is done sequentially with no deterministic order. We could talk for hours about that but if interested maybe better in private as it is not BTRFS related. > > Anyway, to avoid a futile discussion, mdraid and its format is not > > the problem, it is just an example of the problem. Using dm-raid > > would do the same trouble, LVM apparently, too. I could think of a > > bunch of other cases including the use of hardware based RAID > > controllers. OK, it's not the majority's problem, but that's not > > the argument to keep a bug/flaw capable of crashing your system. > > dmraid solves the problem by removing the partitions from the > underlying physical device ( /dev/sda ), and only exposing them on the > array ( /dev/mapper/whatever ). LVM only has the problem when you > take a snapshot. User space tools face the same issue and they > resolve it by ignoring or deprioritizing the snapshot. I don't agree. dmraid and mdraid both remove the partitions. This is not a solution BTRFS will still crash the PC using /dev/mapper/whatever or whatever device appears in the system providing the BTRFS volume. > > As it is a nice feature that the kernel apparently scans for drives > > and automatically identifies BTRFS ones, it seems to me that this > > feature is useless. When in a live system a BTRFS RAID disk fails, > > it is not sufficient to hot-replace it, the kernel will not > > automatically rebalance. Commands are still needed for the task as > > are with mdraid. So the only point I can see at the moment where > > this auto-detect feature makes sense is when mounting the device > > for the first time. If I remember the documentation correctly, you > > mount one of the RAID devices and the others are automagically > > attached as well. But outside of the mount process, what is this > > auto-detect used for? > > > So here a couple of rather simple solutions which, as far as I can > > see, could solve the problem: > > > 1. Limit the auto-detect to the mount process and don't do it when > > devices are appearing. > > > 2. When a BTRFS device is detected and its metadata is identical to > > one already mounted, just ignore it. > > That doesn't really solve the problem since you can still pick the > wrong one to mount in the first place. Oh, it does solve the problem, you are are speaking of another problem which is always there when having several disks in a system. Mounting the wrong device can happen the case I'm describing if you use UUID, label or some other metadata related information to mount it. You won't try do that when you insert a disk you know it has the same metadata. It will not happen (except user tools outsmart you ;-)) when using the device name(s). I think it could be expected from a user mounting things manually to know or learn which device node is which
Re: Running out of disk space during BTRFS_IOC_CLONE - rebalance doesn't help
On Sun, Nov 30, 2014 at 2:29 AM, Guenther Starnberger wrote: > Here's the log output: > > dmesg: > > [235491.227888] [ cut here ] > [235491.227912] WARNING: CPU: 0 PID: 14837 at fs/btrfs/super.c:259 > __btrfs_abort_transaction+0x50/0x110 [btrfs]() > [235491.227914] BTRFS: Transaction aborted (error -28) > [235491.227916] Modules linked in: fuse btrfs xor raid6_pq uas usb_storage > ctr ccm toshiba_acpi sparse_keymap toshiba_haps joydev hp_accel lis3lv02d > input_polldev hdaps(O) btusb bluetooth uvcvideo videobuf2_vmalloc > videobuf2_memops videobuf2_core v4l2_common videodev qcserial media usb_wwan > usbserial arc4 iwldvm snd_hda_codec_hdmi mousedev snd_hda_codec_conexant > snd_hda_codec_generic mac80211 iTCO_wdt iTCO_vendor_support coretemp > intel_powerclamp snd_hda_intel snd_hda_controller snd_hda_codec kvm_intel > snd_hwdep iwlwifi thinkpad_acpi mei_me mei cfg80211 snd_pcm nvram lpc_ich kvm > evdev snd_timer i915 snd mac_hid ac serio_raw e1000e psmouse led_class wmi > rfkill shpchp drm_kms_helper intel_ips i2c_i801 soundcore drm battery hwmon > ptp thermal pps_core i2c_algo_bit i2c_core video intel_agp intel_gtt button > [235491.227968] acpi_cpufreq processor sch_fq_codel tp_smapi(O) > thinkpad_ec(O) nfs lockd sunrpc fscache ext4 crc16 mbcache jbd2 > algif_skcipher af_alg dm_crypt dm_mod atkbd libps2 crc32_pclmul crc32c_intel > ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper > ablk_helper cryptd ehci_pci ehci_hcd usbcore usb_common i8042 serio ata_piix > sd_mod crct10dif_generic crct10dif_pclmul crc_t10dif crct10dif_common ahci > libahci ata_generic libata scsi_mod > [235491.228001] CPU: 0 PID: 14837 Comm: bedup Tainted: GW O > 3.17.4-1-ARCH #1 > [235491.228003] Hardware name: LENOVO 3680U4M/3680U4M, BIOS 6QET68WW (1.38 ) > 12/01/2011 > [235491.228004] 5deed0d1 880144a57a90 > 81537b0e > [235491.228006] 880144a57ad8 880144a57ac8 8107078d > ffe4 > [235491.228008] 8801719dcaa0 88009e273800 a09f7630 > 0c46 > [235491.228010] Call Trace: > [235491.228017] [] dump_stack+0x4d/0x6f > [235491.228021] [] warn_slowpath_common+0x7d/0xa0 > [235491.228024] [] warn_slowpath_fmt+0x5c/0x80 > [235491.228029] [] __btrfs_abort_transaction+0x50/0x110 > [btrfs] > [235491.228040] [] clone_finish_inode_update+0xda/0xf0 > [btrfs] > [235491.228046] [] btrfs_clone+0x6ae/0xcc0 [btrfs] > [235491.228053] [] btrfs_ioctl_clone+0x779/0x7b0 [btrfs] > [235491.228059] [] btrfs_ioctl+0x10d7/0x2810 [btrfs] > [235491.228063] [] ? free_pages_and_swap_cache+0xb9/0xe0 > [235491.228066] [] ? tlb_flush_mmu_free+0x2c/0x50 > [235491.228068] [] ? tlb_finish_mmu+0x4d/0x50 > [235491.228070] [] ? unmap_region+0xe2/0x130 > [235491.228073] [] ? kmem_cache_free+0x199/0x1d0 > [235491.228075] [] do_vfs_ioctl+0x2d0/0x4b0 > [235491.228076] [] ? do_munmap+0x260/0x400 > [235491.228078] [] SyS_ioctl+0x81/0xa0 > [235491.228081] [] system_call_fastpath+0x16/0x1b > [235491.228082] ---[ end trace 636d52c4c1dff6bc ]--- I'm seeing a near exact stack trace. I'm running Syncthing. When making changes to a file, all of the machines that should receive the change appear to show the old file. An ls -l on the receiving machines indicates the file has zero links. Attempting to rename the file causes the filesystem to go read-only and produce the below dmesg. Upon reboot, the file displays the correct contents. I'm running Archlinux with kernel 3.17.6. I'm seeing this error on four machines and can reproduce it consistently. [ 184.546231] WARNING: CPU: 3 PID: 2529 at fs/btrfs/super.c:259 __btrfs_abort_transaction+0x50/0x110 [btrfs]() [ 184.546267] BTRFS: Transaction aborted (error -2) [ 184.546270] Modules linked in: md5 ecb ecryptfs cbc sha256_ssse3 sha256_generic encrypted_keys sha1_ssse3 sha1_generic hmac trusted joydev nf_conntrack_irc nf_conntrack_ftp xt_NFLOG nfnetlink_log nfnetlink xt_limit xt_helper pci_stub vboxpci(O) vboxnetflt(O) vboxnetadp(O) vboxdrv(O) nvidia_uvm(PO) nvidia(PO) ipt_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack dell_wmi sparse_keymap snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_idt snd_hda_codec_generic iTCO_wdt iTCO_vendor_support arc4 brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb rng_core pcmcia pcmcia_core ppdev dell_laptop rfkill dcdbas nls_iso8859_1 coretemp hwmon intel_rapl nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat [ 184.546363] kvm_intel kvm uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common mousedev videodev psmouse media serio_raw i2c_i801 parport_pc snd_hda_intel snd_hda_controller snd_hda_codec tpm_tis snd_hwdep bcma wmi tpm lpc_ich snd_pcm parport e1000e dell_smo8800 evdev battery snd_timer mac_hid ac snd mei_me thermal mei ptp shpchp pps_core soundcore processor sch_fq_codel btrfs
Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device
On 12/08/2014 09:36 AM, Goffredo Baroncelli wrote: I like this approach, but as I wrote before, it seems that initramfs executes a "btrfs dev scan" (see my previoue email 'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots' date 2014/12/03 9:34): Roll your own for now. I haven't started doing any significant btrfs specific work on it but the initramfs builder in my project http://underdog.sourceforge.net might get you past your problem pretty easily for now. It would be easy to white/black list which devices you want to submit to scan or mount. It is plumbed up to look at each storage region one-by-one so you could assemble your file system that way. (Note that the eventual point of the project isn't really the initramfs stuff but that's what I needed more/first.) It's fairly well documented and I use it for some non-trivially complex systems but its not (yet) so complex that it's hard to design hooks for it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/8/2014 12:36 PM, Goffredo Baroncelli wrote: > I like this approach, but as I wrote before, it seems that > initramfs executes a "btrfs dev scan" (see my previoue email > 'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their > snapshots' date 2014/12/03 9:34): > > $ grep -r "btrfs dev.*scan" /usr/share/initramfs-tools/ > /usr/share/initramfs-tools/scripts/local-premount/btrfs: > /sbin/btrfs device scan 2> /dev/null > > (this is from a debian). However it has to be pointed out that > fedora doesn't seems to do the same... Need to fix that initramfs script then. On the other hand, if one *does* run a scan with no arguments, then it probably is a good idea to ignore snapshots. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUhesjAAoJENRVrw2cjl5RPRAIALK5ERJqRJLSsa6kBSIP8bWe WP531ew49I0Tkc0o3YOCqq07tb4kZ5rsLsPaLE+s3adCe5/wYzQOox4x6ucak1gK 0igazFx9TYM65YRtFzIUAnj/CPN4WwIInwoAac4w2qwCKB56WUbSU60lEsOmFfRr 6m9EUYkBtMRiWfW2jjuj8iLnBW6glexAqTpW1eKWPfF0AGoUXc8AQboNwceFnHi3 vcjmQM6mhL5zH+FJ0Z/meTk/PwVdjEVJQIEcbMpvggAJeqxsm90GHVIsn8C7B80i GcX8GHe+Gw3WJMsaW49slKa+MOjWt2SumN/lrKFYPVwQUguhvg0hC1UG5m3cJFo= =64dH -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device
On 12/08/2014 04:30 PM, Phillip Susi wrote: > On 12/4/2014 1:39 PM, Goffredo Baroncelli wrote: [...] >> To check if a device is a LVM snapshot, it is checked the 'udev' >> device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, >> the device has to be skipped. > >> As consequence, btrfs now depends also by the libudev. > > Rather than modify btrfs device scan to link to libudev and ignore the > caller when commanded to scan a snapshot, wouldn't it be > simpler/better to just fix the udev rule to not *call* btrfs device > scan on the snapshot? I like this approach, but as I wrote before, it seems that initramfs executes a "btrfs dev scan" (see my previoue email 'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots' date 2014/12/03 9:34): $ grep -r "btrfs dev.*scan" /usr/share/initramfs-tools/ /usr/share/initramfs-tools/scripts/local-premount/btrfs:/sbin/btrfs device scan 2> /dev/null (this is from a debian). However it has to be pointed out that fedora doesn't seems to do the same... I have to investigate a bit > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
On 12/07/2014 04:32 PM, Konstantin wrote: I know this and I'm using 0.9 on purpose. I need to boot from these disks so I can't use 1.2 format as the BIOS wouldn't recognize the partitions. Having an additional non-RAID disk for booting introduces a single point of failure which contrary to the idea of RAID>0. GRUB2 has raid 1.1 and 1.2 metadata support via the mdraid1x module. LVM is also supported. I don't know if a stack of both is supported. There is, BTW, no such thing as a (commodity) computer without a single point of failure in it somewhere. I've watched government contracts chase this demon for decades. Be it disk, controller, network card, bus chip, cpu or stick-of-ram you've got a single point of failure somewhere. Actually you likely have several such points of potential failure. For instance, are you _sure_ your BIOS is going to check the second drive if it gets read failure after starting in on your first drive? Chances are it won't because that four-hundred bytes-or-so boot loader on that first disk has no way to branch back into the bios. You can waste a lot of your life chasing that ghost and you'll still discover you've missed it and have to whip out your backup boot media. It may well be worth having a second copy of /boot around, but make sure you stay out of bandersnatch territory when designing your system. "The more you over-think the plumbing, the easier it is to stop up the pipes." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is the actual disk usage of btrfs considered unknowable?
Am Montag, 8. Dezember 2014, 09:57:50 schrieb Austin S Hemmelgarn: > On 2014-12-08 09:47, Martin Steigerwald wrote: > > Hi, > > > > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > >>> Well what would be possible I bet would be a kind of system call like > >>> this: > >>> > >>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, > >>> can I do it *and* give me a guarentee I can. > >>> > >>> So like a more flexible fallocate approach as fallocate just allocates > >>> one > >>> file and you would need to run it for all files you intend to create. > >>> But > >>> challenge would be to estimate metadata allocation beforehand > >>> accurately. > >>> > >>> Or have tar --fallocate -xf which for all files in the archive will > >>> first > >>> call fallocate and only if that succeeded, actually write them. But due > >>> to the nature of tar archives with their content listing across the > >>> whole > >>> archive, this means it may have to read the tar archive twice, so ZIP > >>> archives might be better suited for that. > >> > >> What you suggest is Still Not Practical™ (the tar thing might have some > >> ability if you were willing to analyze every file to the byte level). > >> > >> Compression _can_ make a file _bigger_ than its base size. BTRFS decides > >> whether or not to compress a file based on the results it gets when > >> tying to compress the first N bytes. (I do not know the value of N). But > >> it is _easy_ to have a file where the first N bytes compress well but > >> the bytes after N take up more space than their byte count. So to > >> fallocate() the right size in blocks you'd have to compress the input > >> and determine what BTRFS _would_ _do_ and then allocate that much space > >> instead of the file size. > >> > >> And even then, if you didn't create all the names and directories you > >> might find that the RBtree had to expand (allocate another tree node) > >> one or more times to accommodate the actual files. Lather rinse repeat > >> for any checksum trees and anything hitting a flush barrier because of > >> commit= or sync() events or other writers perturbing your results > >> because it only matters if the filesystem is nearly full and nearly full > >> filesystems may not be quiescent at all. > >> > >> So while the core problem isn't insoluble, in real life it is _not_ > >> _worth_ _solving_. > >> > >> On a nearly empty filesystem, it's going to fit. > >> > >> In a reasonably empty filesystem, it's going to fit. > >> > >> On a nearly full filesystem, it may or may not fit. > >> > >> On a filesystem that is so close to full that you have reason to doubt > >> it will fit, you are going to have a very bad time even if it fits. > >> > >> If you did manage to invent and implement an fallocate algorythm that > >> could make this promise and make it stick, then some other running > >> program is what's going to crash when you use up that last byte anyway. > >> > >> Almost full filesystems are their own reward. > > > > So you basically say that BTRFS with compression does not meet the > > fallocate guarantee. Now thats interesting, cause it basically violates > > the > > documentation for the system call: > > > > DESCRIPTION > > > > The function posix_fallocate() ensures that disk space is allo‐ > > cated for the file referred to by the descriptor fd for the bytes > > in the range starting at offset and continuing for len bytes. > > After a successful call to posix_fallocate(), subsequent writes > > to bytes in the specified range are guaranteed not to fail > > because of lack of disk space. > > > > So in order to be standard compliant there, BTRFS would need to write > > fallocated files uncompressed… wow this is getting complex. > > The other option would be to allocate based on the worst case size > increase for the compression algorithm, (which works out to about 5% > IIRC for zlib and a bit more for lzo) and then possibly discard the > unwritten extents at some later point. Now that seems like a workable solution. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/4/2014 1:39 PM, Goffredo Baroncelli wrote: > LVM snapshots are a problem for the btrfs devices management. BTRFS > assumes that each device have an unique 'device UUID'. A LVM > snapshot breaks this assumption. > > This causes a lot of problems if some btrfs device are > snapshotted: - the set of devices for a btrfs multi-volume > filesystem may be mixed (i.e. some NON snapshotted device with some > snapshotted devices) - /proc/mount may returns a wrong device. > > In the mailing list some posts reported these incidents. > > This patch allows btrfs to skip LVM snapshot during the device scan > phase. > > But if you need to consider a LVM snapshot you can set the > environment variable BTRFS_SKIP_LVM_SNAPSHOT to "no". In this case > the old behavior is applied. > > To check if a device is a LVM snapshot, it is checked the 'udev' > device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, > the device has to be skipped. > > As consequence, btrfs now depends also by the libudev. Rather than modify btrfs device scan to link to libudev and ignore the caller when commanded to scan a snapshot, wouldn't it be simpler/better to just fix the udev rule to not *call* btrfs device scan on the snapshot? -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUhcQyAAoJENRVrw2cjl5R1YAIAJlCine4apKnM+01Tw6JpofZ 447+FVizi5SjdgkcjcREyU5zu1pa7ioOTdExF1v1irN1xMUrRBL/RJcRjjnjkvjB dP8JU0x52MEvQABzQP9ANWJnkMqUJ0j+ryPn+3B7wLP/RtAnIn2P9Vh1EhiLkZ9N TdxZIPtPROWPTFBl9ONTBghOHjWYEtcDMkuTS6ZhwLh5c1LE8d3A9c68ez++oSGz TbS51ITFZCEUyF7E/r/xWHhrYagoRM+xdYqVACpi5eY8rFKl3oH4R96gBK8hNdiN AIOilSsNFscXiflORMAaRquW/7tUolfNt+3TfzTYmaVnK4Hv5h0wiJjiKJhNgDY= =HlmL -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/7/2014 7:32 PM, Konstantin wrote: >> I'm guessing you are using metadata format 0.9 or 1.0, which put >> the metadata at the end of the drive and the filesystem still >> starts in sector zero. 1.2 is now the default and would not have >> this problem as its metadata is at the start of the disk ( well, >> 4k from the start ) and the fs starts further down. > I know this and I'm using 0.9 on purpose. I need to boot from > these disks so I can't use 1.2 format as the BIOS wouldn't > recognize the partitions. Having an additional non-RAID disk for > booting introduces a single point of failure which contrary to the > idea of RAID>0. The bios does not know or care about partitions. All you need is a partition table in the MBR and you can install grub there and have it boot the system from a mdadm 1.1 or 1.2 format array housed in a partition on the rest of the disk. The only time you really *have* to use 0.9 or 1.0 ( and you really should be using 1.0 instead since it handles larger arrays and can't be confused vis. whole disk vs. partition components ) is if you are running a raid1 on the raw disk, with no partition table and then partition inside the array instead, and really, you just shouldn't be doing that. > Anyway, to avoid a futile discussion, mdraid and its format is not > the problem, it is just an example of the problem. Using dm-raid > would do the same trouble, LVM apparently, too. I could think of a > bunch of other cases including the use of hardware based RAID > controllers. OK, it's not the majority's problem, but that's not > the argument to keep a bug/flaw capable of crashing your system. dmraid solves the problem by removing the partitions from the underlying physical device ( /dev/sda ), and only exposing them on the array ( /dev/mapper/whatever ). LVM only has the problem when you take a snapshot. User space tools face the same issue and they resolve it by ignoring or deprioritizing the snapshot. > As it is a nice feature that the kernel apparently scans for drives > and automatically identifies BTRFS ones, it seems to me that this > feature is useless. When in a live system a BTRFS RAID disk fails, > it is not sufficient to hot-replace it, the kernel will not > automatically rebalance. Commands are still needed for the task as > are with mdraid. So the only point I can see at the moment where > this auto-detect feature makes sense is when mounting the device > for the first time. If I remember the documentation correctly, you > mount one of the RAID devices and the others are automagically > attached as well. But outside of the mount process, what is this > auto-detect used for? > > So here a couple of rather simple solutions which, as far as I can > see, could solve the problem: > > 1. Limit the auto-detect to the mount process and don't do it when > devices are appearing. > > 2. When a BTRFS device is detected and its metadata is identical to > one already mounted, just ignore it. That doesn't really solve the problem since you can still pick the wrong one to mount in the first place. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUhbztAAoJENRVrw2cjl5RomkH/26Q3M6LXVaF0qEcEzFTzGEL uVAOKBY040Ui5bSK0WQYnH0XtE8vlpLSFHxrRa7Ygpr3jhffSsu6ZsmbOclK64ZA Z8rNEmRFhOxtFYTcQwcUbeBtXEN3k/5H49JxbjUDItnVPBoeK3n7XG4i1Lap5IdY GXyLbh7ogqd/p+wX6Om20NkJSx4xzyU85E4ZvDADQA+2RIBaXva5tDPx5/UD4XBQ h8ai+wS1iC8EySKxwKBEwzwb7+Z6w7nOWO93v/lL34fwTg0OIY9uEfTaAy5KcDjz z6QXWTmvrbiFpyy/qyGSqBGlPjZ+r98mVEDbYWCVfK8AoD6UmteD7R8WAWkWiWY= =PJww -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is the actual disk usage of btrfs considered unknowable?
On 2014-12-08 09:47, Martin Steigerwald wrote: Hi, Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: On 12/07/2014 07:40 AM, Martin Steigerwald wrote: Well what would be possible I bet would be a kind of system call like this: I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, can I do it *and* give me a guarentee I can. So like a more flexible fallocate approach as fallocate just allocates one file and you would need to run it for all files you intend to create. But challenge would be to estimate metadata allocation beforehand accurately. Or have tar --fallocate -xf which for all files in the archive will first call fallocate and only if that succeeded, actually write them. But due to the nature of tar archives with their content listing across the whole archive, this means it may have to read the tar archive twice, so ZIP archives might be better suited for that. What you suggest is Still Not Practical™ (the tar thing might have some ability if you were willing to analyze every file to the byte level). Compression _can_ make a file _bigger_ than its base size. BTRFS decides whether or not to compress a file based on the results it gets when tying to compress the first N bytes. (I do not know the value of N). But it is _easy_ to have a file where the first N bytes compress well but the bytes after N take up more space than their byte count. So to fallocate() the right size in blocks you'd have to compress the input and determine what BTRFS _would_ _do_ and then allocate that much space instead of the file size. And even then, if you didn't create all the names and directories you might find that the RBtree had to expand (allocate another tree node) one or more times to accommodate the actual files. Lather rinse repeat for any checksum trees and anything hitting a flush barrier because of commit= or sync() events or other writers perturbing your results because it only matters if the filesystem is nearly full and nearly full filesystems may not be quiescent at all. So while the core problem isn't insoluble, in real life it is _not_ _worth_ _solving_. On a nearly empty filesystem, it's going to fit. In a reasonably empty filesystem, it's going to fit. On a nearly full filesystem, it may or may not fit. On a filesystem that is so close to full that you have reason to doubt it will fit, you are going to have a very bad time even if it fits. If you did manage to invent and implement an fallocate algorythm that could make this promise and make it stick, then some other running program is what's going to crash when you use up that last byte anyway. Almost full filesystems are their own reward. So you basically say that BTRFS with compression does not meet the fallocate guarantee. Now thats interesting, cause it basically violates the documentation for the system call: DESCRIPTION The function posix_fallocate() ensures that disk space is allo‐ cated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. So in order to be standard compliant there, BTRFS would need to write fallocated files uncompressed… wow this is getting complex. The other option would be to allocate based on the worst case size increase for the compression algorithm, (which works out to about 5% IIRC for zlib and a bit more for lzo) and then possibly discard the unwritten extents at some later point. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.
On 12/08/2014 03:02 AM, Qu Wenruo wrote: > > Original Message > Subject: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices. > From: Goffredo Baroncelli > To: > Date: 2014年12月05日 02:39 >> LVM snapshots create a problem to the btrfs devices management. >> BTRFS assumes that each device haw an unique 'device UUID'. >> A LVM snapshot breaks this assumption. >> >> This patch skips LVM snapshots during the device scan phase. >> If you need to consider a LVM snapshot you have to set the >> environmental variable BTRFS_SKIP_LVM_SNAPSHOT to "no". > IMHO, it is better only to skip LVM snapshot if and only if the snapshot > contains a btrfs with > conflicting UUID. Hi Qu, Currently the "scan phase" in btrfs is done 1 device at time. (udev finds a new device and starts "btrfs dev scan "), and I haven't changed that. This means that btrfs[-prog] doesn't know which devices are already registered, or not. And if even it would know this information, you have to consider the case where the snapshot appears before the real target. So btrfs[-prog] is not in position to perform this analysis [see below my other comment] > Since LVM is such a flexible block level volume management, it is possible > that some one > did a snapshot of a btrfs fs, and then reformat the original one to other fs. > In that case, the LVM snapshot skip seems overkilled. > > Also, personally, I prefer there will be some option like -i to allow user to > choose which device is > used when conflicting uuid is detected. This seems to be the best case and > user can have the full control > on device scan. This also makes the environment variant not needed. > > LVM snapshot skip (only when conflicting) is better to be the fallback > behavior if -i is not given. I understood your reasons, but I don't find any solution compatible with the current btrfs device registration model (asynchronously when the device appears). In another patch set, I proposed a mount.btrfs helper which is in position to perform these analysis and to pick the "right" device (even with the user suggestion). Today the lvm-snapshot and btrfs behave very poor: it is not predictable which device is pick (the original or the snapshot). These patch *avoid* most problems skipping the snapshots, which to me seems a reasonable default. For the other case the user is still able to mount any disks [combination] passing them directly via command line ( mount /dev/sdX -o device=/dev/sdY,device=/dev/sdz... ); Anyway I think for these kind of setup (btrfs on lvm-snapshot), passing the disks explicitly is the only solution; in fact your suggestion about the '-i' switch is not very different. > Thanks, > Qu BR G.Baroncelli >> >> To check if a device is a LVM snapshot, it is checked the >> 'udev' device property 'DM_UDEV_LOW_PRIORITY_FLAG' . >> If it is set to 1, the device has to be skipped. >> >> As conseguence, btrfs now depends by libudev. >> >> Programmatically you can control this behavior with the functions: >> - btrfs_scan_set_skip_lvm_snapshot(int new_value) >> - int btrfs_scan_get_skip_lvm_snapshot( ) >> >> Signed-off-by: Goffredo Baroncelli >> --- >> Makefile | 4 +-- >> utils.c | 107 >> +++ >> utils.h | 9 +- >> 3 files changed, 117 insertions(+), 3 deletions(-) >> >> diff --git a/Makefile b/Makefile >> index 4cae30c..9464361 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -26,7 +26,7 @@ TESTS = fsck-tests.sh convert-tests.sh >> INSTALL = install >> prefix ?= /usr/local >> bindir = $(prefix)/bin >> -lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L. >> +lib_LIBS = -luuid -lblkid -lm -lz -ludev -llzo2 -L. >> libdir ?= $(prefix)/lib >> incdir = $(prefix)/include/btrfs >> LIBS = $(lib_LIBS) $(libs_static) >> @@ -99,7 +99,7 @@ lib_links = libbtrfs.so.0 libbtrfs.so >> headers = $(libbtrfs_headers) >> # make C=1 to enable sparse >> -check_defs := .cc-defines.h >> +check_defs := .cc-defines.h >> ifdef C >> # >> # We're trying to use sparse against glibc headers which go wild >> diff --git a/utils.c b/utils.c >> index 2a92416..9887f8b 100644 >> --- a/utils.c >> +++ b/utils.c >> @@ -29,6 +29,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -52,6 +53,13 @@ >> #define BLKDISCARD_IO(0x12,119) >> #endif >> +/* >> + * This variable controls if the lvm snapshot have to be skipped or not. >> + * Access this variable only via the btrfs_scan_[sg]et_skip_lvm_snapshot() >> + * functions >> + */ >> +static int __scan_device_skip_lvm_snapshot = -1; >> + >> static int btrfs_scan_done = 0; >> static char argv0_buf[ARGV0_BUF_SIZE] = "btrfs"; >> @@ -1593,6 +1601,9 @@ int btrfs_scan_block_devices(int run_ioctl) >> char fullpath[110]; >> int scans = 0; >> int special; >> +int skip_snapshot; >> + >> +skip_snapshot = btrfs_scan_get_skip_lvm_snapshot();
Re: Possible to undo subvol delete?
On 2014-12-08 09:16, Shriramana Sharma wrote: On Mon, Dec 8, 2014 at 6:31 PM, Austin S Hemmelgarn wrote: Personally, I prefer a somewhat hybrid approach where everyone has *sbin in their path, but file permissions are used to control what non-administrators can run. This is exactly the same approach as Ubuntu, since non-superuser can't really do anything active (whether creating or deleting) with */sbin commands, but only querying (like ifconfig, btrfs subvol list etc). So this is not really hybrid of anything it seems. IIRC, Ubuntu relies on the fact that normal users don't have the capabilities required for the privileged operations, as opposed to just not letting them run the binaries at all. smime.p7s Description: S/MIME Cryptographic Signature
Re: Why is the actual disk usage of btrfs considered unknowable?
Hi, Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > > Well what would be possible I bet would be a kind of system call like > > this: > > > > I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, > > can I do it *and* give me a guarentee I can. > > > > So like a more flexible fallocate approach as fallocate just allocates one > > file and you would need to run it for all files you intend to create. But > > challenge would be to estimate metadata allocation beforehand accurately. > > > > Or have tar --fallocate -xf which for all files in the archive will first > > call fallocate and only if that succeeded, actually write them. But due > > to the nature of tar archives with their content listing across the whole > > archive, this means it may have to read the tar archive twice, so ZIP > > archives might be better suited for that. > > What you suggest is Still Not Practical™ (the tar thing might have some > ability if you were willing to analyze every file to the byte level). > > Compression _can_ make a file _bigger_ than its base size. BTRFS decides > whether or not to compress a file based on the results it gets when > tying to compress the first N bytes. (I do not know the value of N). But > it is _easy_ to have a file where the first N bytes compress well but > the bytes after N take up more space than their byte count. So to > fallocate() the right size in blocks you'd have to compress the input > and determine what BTRFS _would_ _do_ and then allocate that much space > instead of the file size. > > And even then, if you didn't create all the names and directories you > might find that the RBtree had to expand (allocate another tree node) > one or more times to accommodate the actual files. Lather rinse repeat > for any checksum trees and anything hitting a flush barrier because of > commit= or sync() events or other writers perturbing your results > because it only matters if the filesystem is nearly full and nearly full > filesystems may not be quiescent at all. > > So while the core problem isn't insoluble, in real life it is _not_ > _worth_ _solving_. > > On a nearly empty filesystem, it's going to fit. > > In a reasonably empty filesystem, it's going to fit. > > On a nearly full filesystem, it may or may not fit. > > On a filesystem that is so close to full that you have reason to doubt > it will fit, you are going to have a very bad time even if it fits. > > If you did manage to invent and implement an fallocate algorythm that > could make this promise and make it stick, then some other running > program is what's going to crash when you use up that last byte anyway. > > Almost full filesystems are their own reward. So you basically say that BTRFS with compression does not meet the fallocate guarantee. Now thats interesting, cause it basically violates the documentation for the system call: DESCRIPTION The function posix_fallocate() ensures that disk space is allo‐ cated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. So in order to be standard compliant there, BTRFS would need to write fallocated files uncompressed… wow this is getting complex. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is the actual disk usage of btrfs considered unknowable?
On 12/08/2014 01:12 AM, ashf...@whisperpc.com wrote: > Goffredo, > >> So in case you have a raid1 filesystem on two disks; each disk has 300GB >> free; which is the free space that you expected: 300GB or 600GB and why ? > > You should see 300GB free. That's what you'll see with RAID-1 with a > hardware RAID controller, and with MD RAID. Why would you expect to see > anything else with BTRFS RAID? I had to ask you because in a your previous email you stated something different: On 12/07/2014 09:32 PM, ashf...@whisperpc.com wrote: > I disagree. My experiences with other file-systems, including ZFS, show > that the most common solution is to just deliver to the user the actual > amount of *unused disk space* ^^^ So I expected that you answered with 600GB. But you have told the true: the user want to know how many data is able to store on the disk, and not the unused disk space. But I have to point out that the common case is one disk filesystem where the metadata chunks have a ratio data stored/disk space consumed of 1:2; the data chunks have a ratio of 1:1. This is one reason why is difficult to evaluate the free space: if you have all metadata chunks, you have to half the disk space. Another reason is that there is the idea to allow different raid profiles in the same filesystem. This will further complicate the free space evaluation. > Peter Ashford G.Baroncelli > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible to undo subvol delete?
On Mon, Dec 8, 2014 at 6:31 PM, Austin S Hemmelgarn wrote: > Personally, I prefer a somewhat hybrid approach where everyone has *sbin in > their path, but file permissions are used to control what non-administrators > can run. This is exactly the same approach as Ubuntu, since non-superuser can't really do anything active (whether creating or deleting) with */sbin commands, but only querying (like ifconfig, btrfs subvol list etc). So this is not really hybrid of anything it seems. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix fs corruption on transaction abort if device supports discard
On Mon, Dec 8, 2014 at 1:53 PM, Chris Mason wrote: > On Sun, Dec 7, 2014 at 4:31 PM, Filipe Manana wrote: >> >> When we abort a transaction we iterate over all the ranges marked as dirty >> in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them >> from those trees, add them back (unpin) to the free space caches and, if >> the fs was mounted with "-o discard", perform a discard on those regions. >> Also, after adding the regions to the free space caches, a fitrim ioctl >> call >> can see those ranges in a block group's free space cache and perform a >> discard >> on the ranges, so the same issue can happen without "-o discard" as well. >> >> This causes corruption, affecting one or multiple btree nodes (in the >> worst >> case leaving the fs unmountable) because some of those ranges (the ones in >> the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are >> referred by the last committed super block - breaking the rule that >> anything >> that was committed by a transaction is untouched until the next >> transaction >> commits successfully. > > > This is great work Filipe, thank you! > >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 7e2405a..ce65b0c 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -4134,12 +4134,6 @@ again: >> if (ret) >> break; >> >> - /* opt_discard */ >> - if (btrfs_test_opt(root, DISCARD)) >> - ret = btrfs_error_discard_extent(root, start, >> -end + 1 - start, >> -NULL); >> - > > > While you're here, can you please just delete btrfs_error_discard_extent and > use btrfs_discard_extent directly? It's already being used in non-error > cases, and since it only discards, I don't see how we want to do that on > errors anyway. Agreed, the function doesn't make sense and its name is confusing since it's just an alias to btrfs_discard_extent(). I've sent a separate cleanup patch to remove it: https://patchwork.kernel.org/patch/5456261/ thanks > > -chris > > >> >> clear_extent_dirty(unpin, start, end, GFP_NOFS); >> btrfs_error_unpin_extent_range(root, start, end); >> cond_resched(); >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index c2fc261..fc74a9b 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -5728,7 +5728,8 @@ void btrfs_prepare_extent_commit(struct >> btrfs_trans_handle *trans, >> update_global_block_rsv(fs_info); >> } >> >> -static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 >> end) >> +static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 >> end, >> + const bool return_free_space) >> { >> struct btrfs_fs_info *fs_info = root->fs_info; >> struct btrfs_block_group_cache *cache = NULL; >> @@ -5752,7 +5753,8 @@ static int unpin_extent_range(struct btrfs_root >> *root, u64 start, u64 end) >> >> if (start < cache->last_byte_to_unpin) { >> len = min(len, cache->last_byte_to_unpin - start); >> - btrfs_add_free_space(cache, start, len); >> + if (return_free_space) >> + btrfs_add_free_space(cache, start, len); >> } >> >> start += len; >> @@ -5816,7 +5818,7 @@ int btrfs_finish_extent_commit(struct >> btrfs_trans_handle *trans, >>end + 1 - start, NULL); >> >> clear_extent_dirty(unpin, start, end, GFP_NOFS); >> - unpin_extent_range(root, start, end); >> + unpin_extent_range(root, start, end, true); >> cond_resched(); >> } >> >> @@ -9705,7 +9707,7 @@ out: >> >> int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, >> u64 end) >> { >> - return unpin_extent_range(root, start, end); >> + return unpin_extent_range(root, start, end, false); >> } >> >> int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, >> -- >> 2.1.3 >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: remove non-sense btrfs_error_discard_extent() function
It doesn't do anything special, it just calls btrfs_discard_extent(), so just remove it. Signed-off-by: Filipe Manana --- fs/btrfs/ctree.h| 4 ++-- fs/btrfs/extent-tree.c | 10 ++ fs/btrfs/free-space-cache.c | 4 ++-- 3 files changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c7e5f2a..399a4e0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3429,8 +3429,8 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info); u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end); -int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, - u64 num_bytes, u64 *actual_bytes); +int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, +u64 num_bytes, u64 *actual_bytes); int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 type); int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index fc74a9b..18b63acc 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1891,8 +1891,8 @@ static int btrfs_issue_discard(struct block_device *bdev, return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0); } -static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, - u64 num_bytes, u64 *actual_bytes) +int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, +u64 num_bytes, u64 *actual_bytes) { int ret; u64 discarded_bytes = 0; @@ -9710,12 +9710,6 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) return unpin_extent_range(root, start, end, false); } -int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, - u64 num_bytes, u64 *actual_bytes) -{ - return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes); -} - int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range) { struct btrfs_fs_info *fs_info = root->fs_info; diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index edf32c5..d6c03f7 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2966,8 +2966,8 @@ static int do_trimming(struct btrfs_block_group_cache *block_group, spin_unlock(&block_group->lock); spin_unlock(&space_info->lock); - ret = btrfs_error_discard_extent(fs_info->extent_root, -start, bytes, &trimmed); + ret = btrfs_discard_extent(fs_info->extent_root, + start, bytes, &trimmed); if (!ret) *total_trimmed += trimmed; -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix fs corruption on transaction abort if device supports discard
On Sun, Dec 7, 2014 at 4:31 PM, Filipe Manana wrote: When we abort a transaction we iterate over all the ranges marked as dirty in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them from those trees, add them back (unpin) to the free space caches and, if the fs was mounted with "-o discard", perform a discard on those regions. Also, after adding the regions to the free space caches, a fitrim ioctl call can see those ranges in a block group's free space cache and perform a discard on the ranges, so the same issue can happen without "-o discard" as well. This causes corruption, affecting one or multiple btree nodes (in the worst case leaving the fs unmountable) because some of those ranges (the ones in the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are referred by the last committed super block - breaking the rule that anything that was committed by a transaction is untouched until the next transaction commits successfully. This is great work Filipe, thank you! diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7e2405a..ce65b0c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4134,12 +4134,6 @@ again: if (ret) break; - /* opt_discard */ - if (btrfs_test_opt(root, DISCARD)) - ret = btrfs_error_discard_extent(root, start, -end + 1 - start, -NULL); - While you're here, can you please just delete btrfs_error_discard_extent and use btrfs_discard_extent directly? It's already being used in non-error cases, and since it only discards, I don't see how we want to do that on errors anyway. -chris clear_extent_dirty(unpin, start, end, GFP_NOFS); btrfs_error_unpin_extent_range(root, start, end); cond_resched(); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c2fc261..fc74a9b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5728,7 +5728,8 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans, update_global_block_rsv(fs_info); } -static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) +static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end, + const bool return_free_space) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_group_cache *cache = NULL; @@ -5752,7 +5753,8 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) if (start < cache->last_byte_to_unpin) { len = min(len, cache->last_byte_to_unpin - start); - btrfs_add_free_space(cache, start, len); + if (return_free_space) + btrfs_add_free_space(cache, start, len); } start += len; @@ -5816,7 +5818,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, end + 1 - start, NULL); clear_extent_dirty(unpin, start, end, GFP_NOFS); - unpin_extent_range(root, start, end); + unpin_extent_range(root, start, end, true); cond_resched(); } @@ -9705,7 +9707,7 @@ out: int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) { - return unpin_extent_range(root, start, end); + return unpin_extent_range(root, start, end, false); } int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible to undo subvol delete?
On 2014-12-05 13:11, Shriramana Sharma wrote: OK so from https://forums.opensuse.org/showthread.php/440209-ifconfig I learnt that it's because /sbin, /usr/sbin etc is not on the normal user's path on openSUSE (they are, on Kubuntu). Adding them to PATH fixes the situation. (I wasn't even able to do ifconfig without giving the password. No idea why this is the openSUSE default...) Probably because OpenSUSE/SLES are designed as enterprise distributions, and their primary use case is having a very small number of sysadmins and a potentially large number of normal users. Ubuntu et al. are designed primarily for PC's, where everyone is assumed to be equivalent to an administrator. Personally, I prefer a somewhat hybrid approach where everyone has *sbin in their path, but file permissions are used to control what non-administrators can run. smime.p7s Description: S/MIME Cryptographic Signature
Re: Why is the actual disk usage of btrfs considered unknowable?
On Sun, Dec 7, 2014 at 1:32 PM, wrote: > > I disagree. My experiences with other file-systems, including ZFS, show > that the most common solution is to just deliver to the user the actual > amount of unused disk space. Anything else changes this known value into > a guess or prediction. What is the "actual amount of unused disk space" in a 2x 8GB drives mirror? Very literally, it's 16GB. It's a convenience subtracting the space used for replication (the n mirror copies, or parity). This is in fact how df reported Btrfs volumes with kernel 3.16 and older. A ZFS mirror vdev doesn't work this way, it reports available space as 8GB. The level of replication and number of devices is a function of the vdev, and is fixed. It can't be changed. With Btrfs there isn't a zpool vs vdev type of distinction, and replication level isn't a function of volume but rather that of chunks. At some future point there will be a way to supply a hint (per subvolume, maybe per directory or per file) for the allocator to put the file in a particular chunk which has a particular level of replication and number of devices. And that means "available space" isn't knowable. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is the actual disk usage of btrfs considered unknowable?
> > Original Message > Subject: Re: Why is the actual disk usage of btrfs considered unknowable? > From: > To: > Date: 2014å¹´12æ08æ¥ 08:12 >> Goffredo, >> >>> So in case you have a raid1 filesystem on two disks; each disk has >>> 300GB >>> free; which is the free space that you expected: 300GB or 600GB and why >>> ? >> You should see 300GB free. That's what you'll see with RAID-1 with a >> hardware RAID controller, and with MD RAID. Why would you expect to see >> anything else with BTRFS RAID? >> >> Peter Ashford > Yeah, you pointed out the real problem here: > > [DIFFERENT RESULT FROM DIFFERENT VIEW] > See from *PURE ON-DISK* usage, it is still 600G, no matter what level of > RAID. > See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is > build on BLOCK LEVEL RAID1, > then the *FILESYSTEM* usage will also be 300G > > [BTRFS DOES NOT BELONG TO ANY TYPE] > But, btrfs is neither pure block level management(that should be MD or > HW RAID or LVM), nor a > traditional filesystem!! For the purposes of reporting free space, it is reasonable to assume that the default structure will be used. If the default for the volume or subvolume is RAID-1, then that should be used for 'df' output. Obviously, the same should be done for other RAID levels. > So the root of the problem is, btrfs mixs the position of block level > management and filesystem level > management, which makes everything hard to understand. > You can't treat btrfs raid1 as a complete block level raid1, due to its > flexibility on metadata/data profile different. It will have the same discrepancies as other file-systems with compression, plus a few more of its own, due to chunking. If the file-system can't give a completely accurate answer, it should give one that makes sense. > If vanilla df command shows filesystem level freespace, then btrfs won't > give a accurate on. > > [ONLY PREDICABLE CASE] > For the 300Gx2 case for btrfs, you can only consider it 300G free space > only if you can ensure that > there was/is and will be only RAID1 data/metadata storing on it.(also > need to ignore small space usage on CoW) I disagree. You can consider the RAID structure to be whatever the default structure is. If the default is RAID-1, then that structure should be used to compute the free space for 'df'. The user should understand that by explicitly requesting a different RAID structure, different amounts of space will be used. > [RELIABLE DATA IS ON-DISK USAGE] > Only pure on-disk level usage is *a little* reliable. There is still > problem for unbalanced metadata/data chunk > allocation problem(e.g, all space is allocated for data, no space for > metadata CoW write). I agree. Unused disk space isn't always available to be used by data. Sometimes it's reserved for metadata of one sort or another, and sometimes it's too small to be of use. In addition, BTRFS sometimes (with small files) uses the Metadata chunks for data. Yes, it's a complex problem. There is no simple solution that will make everyone happy. - As for the 'df' output, I believe that the default should be the sum of free space in data chunks, free space in metadata chunks and unallocated space, ignoring any amounts that are small enough that BTRFS won't use them, and adjusted for the RAID level of the volume/subvolume. While it's possible to generate other values that will make sense for specific cases, it's not possible to create one value that is correct in all cases. If it's not possible to be absolutely correct, considering every usage (or even the most common usages), a 'reasonable' value should be returned. That reasonable value should be based on the default volume/subvolume settings, including RAID levels and any space limits that may exist on the volume or subvolume. It should neither be the most optimistic nor the most pessimistic. Peter Ashford -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html