cannot delete subvolumes

2014-12-08 Thread Florian Stuelpner

Hi all,

Our specifications are:
- Debian 7.7
- kernel version: 3.17.4

Two weeks ago we started a btrfs balance. Some days later we stopped the 
balance. The messages log show some errors (see below).


While balancing we removed some snapshots, this operations failed with a 
error messages.


For example: ERROR: error accessing 
'/mnt/btrfs/snapshots/home/2014-10-23--20-31-18-@daily'


Now there are some subvolumes without a owner, last modified date etc. .

Additionally the ls command returns an error:
stuelpner@hsad-srv-03:/boot$ ls -l /mnt/btrfs/snapshots/home/
ls: cannot access /mnt/btrfs/snapshots/home/2014-10-23--20-31-18-@daily: 
Cannot allocate memory

total 0
drwxr-xr-x 1 root root 952 May 20 2014 2014-06-30--18-49-09-@monthly
drwxr-xr-x 1 root root 952 May 20 2014 2014-07-01--02-52-01-@monthly
drwxr-xr-x 1 root root 952 May 20 2014 2014-08-01--02-52-01-@monthly
drwxr-xr-x 1 root root 990 Aug 28 07:00 2014-09-01--02-52-01-@monthly
drwxr-xr-x 1 root root 990 Aug 28 07:00 2014-10-01--02-52-02-@monthly
d? ? ? ? ? ? 2014-10-23--20-31-18-@daily
drwxr-xr-x 1 root root 996 Oct 21 12:38 2014-11-01--02-52-01-@monthly

This subvolumes cannot be access nor deleted:

~# btrfs subvolume
delete /mnt/btrfs/snapshots/transfer/2014-10-21--20-50-44-\@daily
Transaction commit: none (default)
ERROR: error accessing
'/mnt/btrfs/snapshots/transfer/2014-10-21--20-50-44-@daily'

Filesystem info:

~# btrfs fi show
Label: 'btrfs-data' uuid: 47a6ce34-6b63-4202-a7da-c1f6dbe48676
Total devices 4 FS bytes used 2.22TiB
devid 1 size 2.73TiB used 2.42TiB path /dev/sda
devid 2 size 2.73TiB used 2.42TiB path /dev/sdb
devid 3 size 2.73TiB used 186.00GiB path /dev/sdc
devid 4 size 2.73TiB used 186.00GiB path /dev/sdd
Label: 'spectral-data' uuid: aaaba9e0-7710-4295-88b1-c0ee9bd2eff8
Total devices 2 FS bytes used 640.00KiB
devid 1 size 238.47GiB used 2.03GiB path /dev/sdg
devid 2 size 238.47GiB used 2.01GiB path /dev/sdh
Btrfs v3.12

root@hsad-srv-03:~# btrfs fi df /mnt/btrfs
Data, RAID1: total=2.32TiB, used=2.06TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=400.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=287.00GiB, used=159.31GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

message.log:
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323328] btrfs D 
88083ed92dc0 0 7313 7296 0x
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323335] 88068d816310 
0082 88083ae971d0 0086
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323340] 00012dc0 
00012dc0 88068d816310 88058028ffd8
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323345] 88083ec12dc0 
7fff 7fff 0002

Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323351] Call Trace:
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323364] [] ? 
console_conditional_schedule+0xf/0xf
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323372] [] ? 
schedule_timeout+0x1c/0xf7
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323378] [] ? 
__queue_work+0x1ef/0x23d
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323384] [] ? 
__wait_for_common+0x120/0x158
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323390] [] ? 
try_to_wake_up+0x1c7/0x1c7
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323441] [] ? 
btrfs_async_run_delayed_refs+0xbd/0xda [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323475] [] ? 
__btrfs_end_transaction+0x2b3/0x2d6 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323512] [] ? 
relocate_block_group+0x2a1/0x4cd [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323548] [] ? 
btrfs_relocate_block_group+0x14f/0x267 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323584] [] ? 
btrfs_relocate_chunk.isra.58+0x58/0x5e2 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323609] [] ? 
btrfs_item_key_to_cpu+0x12/0x30 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323645] [] ? 
btrfs_get_token_64+0x76/0xc6 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323681] [] ? 
release_extent_buffer+0x9d/0xa4 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323717] [] ? 
btrfs_balance+0x9b0/0xb9d [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323752] [] ? 
btrfs_ioctl_balance+0x21a/0x297 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323788] [] ? 
btrfs_ioctl+0x1134/0x20e5 [btrfs]
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323794] [] ? 
path_openat+0x233/0x4c5
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323801] [] ? 
__do_page_fault+0x339/0x3df

Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323807] [] ? vma_link+0x6b/0x8a
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323813] [] ? 
do_vfs_ioctl+0x3ed/0x436

Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323819] [] ? SyS_ioctl+0x49/0x77
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323824] [] ? 
page_fault+0x22/0x30
Nov 24 01:22:44 hsad-srv-03 kernel: [112256.323829] [] ? 
system_call_fastpath+0x16/0x1b


Do you have an idea? Can you help us please?

Rega

Re: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.

2014-12-08 Thread Qu Wenruo

Hi Goffredo

On 12/08/2014 03:02 AM, Qu Wenruo wrote:

 Original Message 
Subject: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.
From: Goffredo Baroncelli 
To: 
Date: 2014年12月05日 02:39

LVM snapshots create a problem to the btrfs devices management.
BTRFS assumes that each device haw an unique 'device UUID'.
A LVM snapshot breaks this assumption.

This patch skips LVM snapshots during the device scan phase.
If you need to consider a LVM snapshot you have to set the
environmental variable BTRFS_SKIP_LVM_SNAPSHOT to "no".

IMHO, it is better only to skip LVM snapshot if and only if the snapshot 
contains a btrfs with
conflicting UUID.

Hi Qu,

Currently the "scan phase" in btrfs is done 1 device at time.
(udev finds a new device and starts "btrfs dev scan "),
and I haven't changed that. This means that btrfs[-prog] doesn't
know which devices are already registered, or not. And if even
it would know this information, you have to consider the case
where the snapshot appears before the real target. So
btrfs[-prog] is not in position to perform this analysis [see
below my other comment]


Since LVM is such a flexible block level volume management, it is possible that 
some one
did a snapshot of a btrfs fs, and then reformat the original one to other fs.
In that case, the LVM snapshot skip seems overkilled.

Also, personally, I prefer there will be some option like -i to allow user to 
choose which device is
used when conflicting uuid is detected. This seems to be the best case and user 
can have the full control
on device scan. This also makes the environment variant not needed.

LVM snapshot skip (only when conflicting) is better to be the fallback behavior 
if -i is not given.

I understood your reasons, but I don't find any solution
compatible with the current btrfs device registration model
(asynchronously when the device appears).

In another patch set, I proposed a mount.btrfs helper which is
in position to perform these analysis and to pick the "right"
device (even with the user suggestion).

Today the lvm-snapshot and btrfs behave very poor: it is not
predictable which device is pick (the original or the snapshot).
These patch *avoid* most problems skipping the snapshots, which
to me seems a reasonable default.
For the other case the user is still able to mount any disks
[combination] passing them directly via command line (
mount /dev/sdX -o device=/dev/sdY,device=/dev/sdz...  );

Anyway I think for these kind of setup (btrfs on lvm-snapshot),
passing the disks explicitly is the only solution; in fact your
suggestion about the '-i' switch is not very different.


Thanks,
Qu

BR
G.Baroncelli
Your explains sounds quite reasonable, it's good enough before any 
better solutions.


Thanks,
Qu

To check if a device is a LVM snapshot, it is checked the
'udev' device property 'DM_UDEV_LOW_PRIORITY_FLAG' .
If it is set to 1, the device has to be skipped.

As conseguence, btrfs now depends by libudev.

Programmatically you can control this behavior with the functions:
- btrfs_scan_set_skip_lvm_snapshot(int new_value)
- int btrfs_scan_get_skip_lvm_snapshot( )

Signed-off-by: Goffredo Baroncelli 
---
   Makefile |   4 +--
   utils.c  | 107 
+++
   utils.h  |   9 +-
   3 files changed, 117 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 4cae30c..9464361 100644
--- a/Makefile
+++ b/Makefile
@@ -26,7 +26,7 @@ TESTS = fsck-tests.sh convert-tests.sh
   INSTALL = install
   prefix ?= /usr/local
   bindir = $(prefix)/bin
-lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
+lib_LIBS = -luuid -lblkid -lm -lz -ludev -llzo2 -L.
   libdir ?= $(prefix)/lib
   incdir = $(prefix)/include/btrfs
   LIBS = $(lib_LIBS) $(libs_static)
@@ -99,7 +99,7 @@ lib_links = libbtrfs.so.0 libbtrfs.so
   headers = $(libbtrfs_headers)
 # make C=1 to enable sparse
-check_defs := .cc-defines.h
+check_defs := .cc-defines.h
   ifdef C
   #
   # We're trying to use sparse against glibc headers which go wild
diff --git a/utils.c b/utils.c
index 2a92416..9887f8b 100644
--- a/utils.c
+++ b/utils.c
@@ -29,6 +29,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -52,6 +53,13 @@
   #define BLKDISCARD_IO(0x12,119)
   #endif
   +/*
+ * This variable controls if the lvm snapshot have to be skipped or not.
+ * Access this variable only via the btrfs_scan_[sg]et_skip_lvm_snapshot()
+ * functions
+ */
+static int __scan_device_skip_lvm_snapshot = -1;
+
   static int btrfs_scan_done = 0;
 static char argv0_buf[ARGV0_BUF_SIZE] = "btrfs";
@@ -1593,6 +1601,9 @@ int btrfs_scan_block_devices(int run_ioctl)
   char fullpath[110];
   int scans = 0;
   int special;
+int skip_snapshot;
+
+skip_snapshot = btrfs_scan_get_skip_lvm_snapshot();
 scan_again:
   proc_partitions = fopen("/proc/partitions","r");
@@ -1642,6 +1653,9 @@ scan_again:
   continue;

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Robert White

On 12/08/2014 02:38 PM, Konstantin wrote:

For more important systems there are high availability
solutions which alleviate many of the problems you mention of but that's
not the point here when speaking about the major bug in BTRFS which can
make your system crash.


I think you missed the part where I told you that you could use GRUB2 
and then you could use the 1.2 metadata on your raid and then have you 
system work as desired.


Trying to make this all about BTRFS is more than a touch disingenuous as 
you are doing things that can make many systems fail in exactly the same 
way.


Undefined behavior is undefined.

The MDADM people made the latter metadata layouts to address your issue, 
and its up to you to use it. Need it to boot, GRUB2 will boot it, and 
it's up to you to use it.


New software fixes problems evident in the old, but once you decide to 
stick with the old despite the new, your problem becomes uninteresting 
because it was already fixed.


So yes, if you use the woefully out of date metadata and boot loader you 
will have problems. If you use the distro scripts that scan the volumes 
you don't want scanned, you will have problems. People are working on 
making sure that those problems have work arounds. And sometimes the 
work around for "doctor, it hurts when I do this" is "don't do that any 
more".


It is multiplicatively impossible to build BTRFS such that it can dance 
through the entire Cartesian Product of all possible storage management 
solutions. Just as it was impossible for LVM and MDADM before them. If 
your system is layered, _you_ bear the burden of making sure that the 
layers are applied. Each tool is evolving to help you, but its still you 
doing the system design.


GRUB has put in modules for everything you need (so far) to boot. mdadm 
has better signatures if you use them. LVM always had device offsets 
built into its metadata block.


But answering the negative. The sort of question that might be phrased 
"how do you know it's _not_ mdadm old style signatures" is an unbounded 
coding, not because any one is impossible to code, but because an 
endless stream of possibilities is coming in the pipe. A striped storage 
controller might make a system look like /dev/sdb is a stand-alone BTRFS 
file system if the controller doesn't start and the mdadm and lvm 
signatures are on /dev/sda and take up "just the right amount of room".


If I do an mkfs.ext2 on a media, then do a cryptsetup luksCreate on that 
same media, I can mount it either way, with disastrous consequences for 
the other semantic layout.


The bad combinations available are virtually limitless.

There comes a point where the System Architect that decided how to build 
the individual system has to take responsibility for his actions.


Note that the same "it didn't protect me" errors can happen _easily_ 
with other filesystems. Try building an NTFS on a disk, then build an 
ext4 on the same disk then mount as either or both. (though now days you 
may need to build the ext4 then the NTFS since I think mkfs.ext4 may now 
have a little dedicated wiper to de-NTFS a disk after that went sour a 
few too many times).


When storage signatures conflict you will get "exciting" outcomes. It 
will always be that way, and its not an "error" in any of that 
filesystem code. You, the System Architect, bear a burden here.


The system isn't shooting "itself" when you do certain things. The 
System Architect is shooting the system with a bad layout bullet.


You don't want some LV to be scanned... don't scan it... If your tools 
scan it automatically, don't use those tools that way. "But my distro 
automatically" is just a reason to look twice at your distro or your design.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Zygo Blaxell
On Mon, Dec 08, 2014 at 03:47:23PM +0100, Martin Steigerwald wrote:
> Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> > On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> > Almost full filesystems are their own reward.
> 
> So you basically say that BTRFS with compression  does not meet the fallocate 
> guarantee. Now thats interesting, cause it basically violates the 
> documentation for the system call:
> 
> DESCRIPTION
>The function posix_fallocate() ensures that disk space  is  allo‐
>cated for the file referred to by the descriptor fd for the bytes
>in the range starting at offset and  continuing  for  len  bytes.
>After  a  successful call to posix_fallocate(), subsequent writes
>to bytes in the  specified  range  are  guaranteed  not  to  fail
>because of lack of disk space.
> 
> So in order to be standard compliant there, BTRFS would need to write 
> fallocated files uncompressed… wow this is getting complex.

...and nodatacow and no snapshots, since those require more space that
was never anticipated by fallocate.

Given the choice, I'd just let fallocate fail.  Usually when I come
across a program using fallocate, I end up patching it so it doesn't use
fallocate any more.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Konstantin

Robert White schrieb am 08.12.2014 um 18:20:
> On 12/07/2014 04:32 PM, Konstantin wrote:
>> I know this and I'm using 0.9 on purpose. I need to boot from these
>> disks so I can't use 1.2 format as the BIOS wouldn't recognize the
>> partitions. Having an additional non-RAID disk for booting introduces a
>> single point of failure which contrary to the idea of RAID>0.
>
> GRUB2 has raid 1.1 and 1.2 metadata support via the mdraid1x module.
> LVM is also supported. I don't know if a stack of both is supported.
>
> There is, BTW, no such thing as a (commodity) computer without a
> single point of failure in it somewhere. I've watched government
> contracts chase this demon for decades. Be it disk, controller,
> network card, bus chip, cpu or stick-of-ram you've got a single point
> of failure somewhere. Actually you likely have several such points of
> potential failure.
>
> For instance, are you _sure_ your BIOS is going to check the second
> drive if it gets read failure after starting in on your first drive?
> Chances are it won't because that four-hundred bytes-or-so boot loader
> on that first disk has no way to branch back into the bios.
>
> You can waste a lot of your life chasing that ghost and you'll still
> discover you've missed it and have to whip out your backup boot media.
>
> It may well be worth having a second copy of /boot around, but make
> sure you stay out of bandersnatch territory when designing your
> system. "The more you over-think the plumbing, the easier it is to
> stop up the pipes."
You are right, there is as good as always a single point of failure
somewhere, even if it is the power plant providing your electricity ;-).
I should have written "introduces an additional single point of failure"
to be 100% correct but I thought this was obvious. As I have replaced
dozens of damaged hard disks but only a few CPUs, RAMs etc. it is more
important for me to reduce the most frequent and easy-to-solve points of
failure. For more important systems there are high availability
solutions which alleviate many of the problems you mention of but that's
not the point here when speaking about the major bug in BTRFS which can
make your system crash.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] dm: log writes target

2014-12-08 Thread Josef Bacik
This is my latest attempt at a target for testing power fail and fs consistency.
This is based on the idea Zach Brown had where we could just walk through all
the operations done to an fs in order to verify we're doing the correct thing.
There is a userspace component as well that can be found here

https://github.com/josefbacik/log-writes

It is very rough as I just threw it together to test the various aspects of how
you would want to replay a log to test it.  Again I would love feedback on this,
I really want to have something that we all think is usefull and eventually
incorporate it into xfstests.


From: Josef Bacik 
Subject: [PATCH]dm: log writes target

This creates a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system.  We capture
all write requests and the data and log the requests and the data to a separate
device for later replay.  There is a userspace utility to do this replay.  The
idea behind this is to give file system developers to verify that the file
system is always consistent.  Thanks,

Signed-off-by: Josef Bacik 
---
 Documentation/device-mapper/dm-log-writes.txt | 136 +
 drivers/md/Kconfig|  16 +
 drivers/md/Makefile   |   1 +
 drivers/md/dm-log-writes.c| 835 ++
 4 files changed, 988 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-log-writes.txt
 create mode 100644 drivers/md/dm-log-writes.c

diff --git a/Documentation/device-mapper/dm-log-writes.txt 
b/Documentation/device-mapper/dm-log-writes.txt
new file mode 100644
index 000..f3a9fa2
--- /dev/null
+++ b/Documentation/device-mapper/dm-log-writes.txt
@@ -0,0 +1,136 @@
+dm-log-writes
+=
+
+This target takes 2 devices, one to pass all IO to normally, and one to log all
+of the write operations to.  This is intended for file system developers 
wishing
+to verify the integrity of metadata or data as the file system is written to.
+There is a log_writes_entry written for every WRITE request and the target is
+able to take arbitrary data from userspace to insert into the log.  The data
+that is in the WRITE requests is copied into the log to make the replay happen
+exactly as it happened originally.
+
+Log Ordering
+
+
+We log things in order of completion once we are sure the write is no longer in
+cache.  This means that normal WRITE requests are not actually logged until the
+next REQ_FLUSH request.  This is to make it easier for userspace to replay the
+log in a way that correlates to what is on disk and not what is in cache, to
+make it easier to detect improper waiting/flushing.
+
+This works by attaching all WRITE requests to a list once the write completes.
+Once we see a REQ_FLUSH request we splice this list onto the request and once
+the FLUSH request completes we log all of the WRITE's and then the FLUSH.  Only
+completeled WRITEs at the time of the issue of the REQ_FLUSH are added in order
+to simulate the worst case scenario with regard to power failures.  Consider 
the
+following example (W means write, C means complete)
+
+W1,W2,W3,C3,C2,Wflush,C1,Cflush
+
+The log would show the following
+
+W3,W2,flush,W1
+
+Again this is to simulate what is actually on disk, this allows us to detect
+cases where a power failure at a particular point in time would create an
+inconsistent file system.
+
+Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
+they complete as those requests will obviously bypass the device cache.
+
+Any REQ_DISCARD requests are treated like WRITE requests.  This is because
+otherwise we would have all the DISCARD requests, and then the WRITE requests
+and then the FLUSH request.  Consider the following example
+
+WRITE block 1, DISCARD block 1, FLUSH
+
+If we logged DISCARD when it completed, the replay would look like this
+
+DISCARD 1, WRITE 1, FLUSH
+
+which isn't quite what happened and wouldn't be caught during the log replay.
+
+Marks
+=
+
+You can use dmsetup to set an arbitrary mark in a log.  For example say you 
want
+to fsck an file system after every write, but first you need to replay up to 
the
+mkfs to make sure we're fsck'ing something reasonable, you would do something
+like this
+
+mkfs.btrfs -f /dev/mapper/log
+dmsetup message log 0 mark mkfs
+
+
+This would allow you to replay the log up to the mkfs mark and then replay from
+that point on doing the fsck check in the interval that you want.
+
+Every log has a mark at the end labeled "log-writes-end".
+
+Userspace component
+===
+
+There is a userspace tool that will replay the log for you in various ways.
+As of this writing the options are not well documented, they will be in the
+future.  It can be found here
+
+https://github.com/josefbacik/log-writes
+
+Example usage
+=
+
+Say you want to test fsync on your file system.  You would do something like
+thi

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Konstantin

Phillip Susi schrieb am 08.12.2014 um 15:59:
> On 12/7/2014 7:32 PM, Konstantin wrote:
> >> I'm guessing you are using metadata format 0.9 or 1.0, which put
> >> the metadata at the end of the drive and the filesystem still
> >> starts in sector zero.  1.2 is now the default and would not have
> >> this problem as its metadata is at the start of the disk ( well,
> >> 4k from the start ) and the fs starts further down.
> > I know this and I'm using 0.9 on purpose. I need to boot from
> > these disks so I can't use 1.2 format as the BIOS wouldn't
> > recognize the partitions. Having an additional non-RAID disk for
> > booting introduces a single point of failure which contrary to the
> > idea of RAID>0.
>
> The bios does not know or care about partitions.  All you need is a
That's only true for older BIOSs. With current EFI boards they not only
care but some also mess around with GPT partition tables.
> partition table in the MBR and you can install grub there and have it
> boot the system from a mdadm 1.1 or 1.2 format array housed in a
> partition on the rest of the disk.  The only time you really *have* to
I was thinking of this solution as well but as I'm not aware of any
partitioning tool caring about mdadm metadata so I rejected it. It
requires a non-standard layout leaving reserved empty spaces for mdadm
metadata. It's possible but it isn't documented so far I know and before
losing hours of trying I chose the obvious one.
> use 0.9 or 1.0 ( and you really should be using 1.0 instead since it
> handles larger arrays and can't be confused vis. whole disk vs.
> partition components ) is if you are running a raid1 on the raw disk,
> with no partition table and then partition inside the array instead,
> and really, you just shouldn't be doing that.
That's exactly what I want to do - running RAID1 on the whole disk as
most hardware based RAID systems do. Before that I was running RAID on
disk partitions for some years but this was quite a pain in comparison.
Hot(un)plugging a drive brings you a lot of issues with failing mdadm
commands as they don't like concurrent execution when the same physical
device is affected. And rebuild of RAID partitions is done sequentially
with no deterministic order. We could talk for hours about that but if
interested maybe better in private as it is not BTRFS related.
> > Anyway, to avoid a futile discussion, mdraid and its format is not
> > the problem, it is just an example of the problem. Using dm-raid
> > would do the same trouble, LVM apparently, too. I could think of a
> > bunch of other cases including the use of hardware based RAID
> > controllers. OK, it's not the majority's problem, but that's not
> > the argument to keep a bug/flaw capable of crashing your system.
>
> dmraid solves the problem by removing the partitions from the
> underlying physical device ( /dev/sda ), and only exposing them on the
> array ( /dev/mapper/whatever ).  LVM only has the problem when you
> take a snapshot.  User space tools face the same issue and they
> resolve it by ignoring or deprioritizing the snapshot.
I don't agree. dmraid and mdraid both remove the partitions. This is not
a solution BTRFS will still crash the PC using /dev/mapper/whatever or
whatever device appears in the system providing the BTRFS volume.
> > As it is a nice feature that the kernel apparently scans for drives
> > and automatically identifies BTRFS ones, it seems to me that this
> > feature is useless. When in a live system a BTRFS RAID disk fails,
> > it is not sufficient to hot-replace it, the kernel will not
> > automatically rebalance. Commands are still needed for the task as
> > are with mdraid. So the only point I can see at the moment where
> > this auto-detect feature makes sense is when mounting the device
> > for the first time. If I remember the documentation correctly, you
> > mount one of the RAID devices and the others are automagically
> > attached as well. But outside of the mount process, what is this
> > auto-detect used for?
>
> > So here a couple of rather simple solutions which, as far as I can
> > see, could solve the problem:
>
> > 1. Limit the auto-detect to the mount process and don't do it when
> > devices are appearing.
>
> > 2. When a BTRFS device is detected and its metadata is identical to
> > one already mounted, just ignore it.
>
> That doesn't really solve the problem since you can still pick the
> wrong one to mount in the first place.
Oh, it does solve the problem, you are are speaking of another problem
which is always there when having several disks in a system. Mounting
the wrong device can happen the case I'm describing if you use UUID,
label or some other metadata related information to mount it. You won't
try do that when you insert a disk you know it has the same metadata. It
will not happen (except user tools outsmart you ;-)) when using the
device name(s). I think it could be expected from a user mounting things
manually to know or learn which device node is which

Re: Running out of disk space during BTRFS_IOC_CLONE - rebalance doesn't help

2014-12-08 Thread Dave
On Sun, Nov 30, 2014 at 2:29 AM, Guenther Starnberger
 wrote:
> Here's the log output:
>
> dmesg:
>
> [235491.227888] [ cut here ]
> [235491.227912] WARNING: CPU: 0 PID: 14837 at fs/btrfs/super.c:259 
> __btrfs_abort_transaction+0x50/0x110 [btrfs]()
> [235491.227914] BTRFS: Transaction aborted (error -28)
> [235491.227916] Modules linked in: fuse btrfs xor raid6_pq uas usb_storage 
> ctr ccm toshiba_acpi sparse_keymap toshiba_haps joydev hp_accel lis3lv02d 
> input_polldev hdaps(O) btusb bluetooth uvcvideo videobuf2_vmalloc 
> videobuf2_memops videobuf2_core v4l2_common videodev qcserial media usb_wwan 
> usbserial arc4 iwldvm snd_hda_codec_hdmi mousedev snd_hda_codec_conexant 
> snd_hda_codec_generic mac80211 iTCO_wdt iTCO_vendor_support coretemp 
> intel_powerclamp snd_hda_intel snd_hda_controller snd_hda_codec kvm_intel 
> snd_hwdep iwlwifi thinkpad_acpi mei_me mei cfg80211 snd_pcm nvram lpc_ich kvm 
> evdev snd_timer i915 snd mac_hid ac serio_raw e1000e psmouse led_class wmi 
> rfkill shpchp drm_kms_helper intel_ips i2c_i801 soundcore drm battery hwmon 
> ptp thermal pps_core i2c_algo_bit i2c_core video intel_agp intel_gtt button
> [235491.227968]  acpi_cpufreq processor sch_fq_codel tp_smapi(O) 
> thinkpad_ec(O) nfs lockd sunrpc fscache ext4 crc16 mbcache jbd2 
> algif_skcipher af_alg dm_crypt dm_mod atkbd libps2 crc32_pclmul crc32c_intel 
> ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper 
> ablk_helper cryptd ehci_pci ehci_hcd usbcore usb_common i8042 serio ata_piix 
> sd_mod crct10dif_generic crct10dif_pclmul crc_t10dif crct10dif_common ahci 
> libahci ata_generic libata scsi_mod
> [235491.228001] CPU: 0 PID: 14837 Comm: bedup Tainted: GW  O   
> 3.17.4-1-ARCH #1
> [235491.228003] Hardware name: LENOVO 3680U4M/3680U4M, BIOS 6QET68WW (1.38 ) 
> 12/01/2011
> [235491.228004]   5deed0d1 880144a57a90 
> 81537b0e
> [235491.228006]  880144a57ad8 880144a57ac8 8107078d 
> ffe4
> [235491.228008]  8801719dcaa0 88009e273800 a09f7630 
> 0c46
> [235491.228010] Call Trace:
> [235491.228017]  [] dump_stack+0x4d/0x6f
> [235491.228021]  [] warn_slowpath_common+0x7d/0xa0
> [235491.228024]  [] warn_slowpath_fmt+0x5c/0x80
> [235491.228029]  [] __btrfs_abort_transaction+0x50/0x110 
> [btrfs]
> [235491.228040]  [] clone_finish_inode_update+0xda/0xf0 
> [btrfs]
> [235491.228046]  [] btrfs_clone+0x6ae/0xcc0 [btrfs]
> [235491.228053]  [] btrfs_ioctl_clone+0x779/0x7b0 [btrfs]
> [235491.228059]  [] btrfs_ioctl+0x10d7/0x2810 [btrfs]
> [235491.228063]  [] ? free_pages_and_swap_cache+0xb9/0xe0
> [235491.228066]  [] ? tlb_flush_mmu_free+0x2c/0x50
> [235491.228068]  [] ? tlb_finish_mmu+0x4d/0x50
> [235491.228070]  [] ? unmap_region+0xe2/0x130
> [235491.228073]  [] ? kmem_cache_free+0x199/0x1d0
> [235491.228075]  [] do_vfs_ioctl+0x2d0/0x4b0
> [235491.228076]  [] ? do_munmap+0x260/0x400
> [235491.228078]  [] SyS_ioctl+0x81/0xa0
> [235491.228081]  [] system_call_fastpath+0x16/0x1b
> [235491.228082] ---[ end trace 636d52c4c1dff6bc ]---

I'm seeing a near exact stack trace.  I'm running Syncthing.  When
making changes to a file, all of the machines that should receive the
change appear to show the old file.  An ls -l on the receiving
machines indicates the file has zero links.  Attempting to rename the
file causes the filesystem to go read-only and produce the below
dmesg.  Upon reboot, the file displays the correct contents.  I'm
running Archlinux with kernel 3.17.6.  I'm seeing this error on four
machines and can reproduce it consistently.


[  184.546231] WARNING: CPU: 3 PID: 2529 at fs/btrfs/super.c:259
__btrfs_abort_transaction+0x50/0x110 [btrfs]()
[  184.546267] BTRFS: Transaction aborted (error -2)
[  184.546270] Modules linked in: md5 ecb ecryptfs cbc sha256_ssse3
sha256_generic encrypted_keys sha1_ssse3 sha1_generic hmac trusted
joydev nf_conntrack_irc nf_conntrack_ftp xt_NFLOG nfnetlink_log
nfnetlink xt_limit xt_helper pci_stub vboxpci(O) vboxnetflt(O)
vboxnetadp(O) vboxdrv(O) nvidia_uvm(PO) nvidia(PO) ipt_REJECT
xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack dell_wmi sparse_keymap
snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter ip_tables
x_tables snd_hda_codec_idt snd_hda_codec_generic iTCO_wdt
iTCO_vendor_support arc4 brcmsmac cordic brcmutil b43 mac80211
cfg80211 ssb rng_core pcmcia pcmcia_core ppdev dell_laptop rfkill
dcdbas nls_iso8859_1 coretemp hwmon intel_rapl nls_cp437
x86_pkg_temp_thermal vfat intel_powerclamp fat
[  184.546363]  kvm_intel kvm uvcvideo videobuf2_vmalloc
videobuf2_memops videobuf2_core v4l2_common mousedev videodev psmouse
media serio_raw i2c_i801 parport_pc snd_hda_intel snd_hda_controller
snd_hda_codec tpm_tis snd_hwdep bcma wmi tpm lpc_ich snd_pcm parport
e1000e dell_smo8800 evdev battery snd_timer mac_hid ac snd mei_me
thermal mei ptp shpchp pps_core soundcore processor sch_fq_codel btrfs

Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Robert White

On 12/08/2014 09:36 AM, Goffredo Baroncelli wrote:

I like this approach, but as I wrote before, it seems that
initramfs executes a "btrfs dev scan" (see my previoue email
'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots'
date 2014/12/03 9:34):


Roll your own for now. I haven't started doing any significant btrfs 
specific work on it but the initramfs builder in my project 
http://underdog.sourceforge.net might get you past your problem pretty 
easily for now. It would be easy to white/black list which devices you 
want to submit to scan or mount.


It is plumbed up to look at each storage region one-by-one so you could 
assemble your file system that way.


(Note that the eventual point of the project isn't really the initramfs 
stuff but that's what I needed more/first.)


It's fairly well documented and I use it for some non-trivially complex 
systems but its not (yet) so complex that it's hard to design hooks for it.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/8/2014 12:36 PM, Goffredo Baroncelli wrote:
> I like this approach, but as I wrote before, it seems that 
> initramfs executes a "btrfs dev scan" (see my previoue email 
> 'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their
> snapshots' date 2014/12/03 9:34):
> 
> $ grep -r "btrfs dev.*scan" /usr/share/initramfs-tools/ 
> /usr/share/initramfs-tools/scripts/local-premount/btrfs:
> /sbin/btrfs device scan 2> /dev/null
> 
> (this is from a debian). However it has to be pointed out that 
> fedora doesn't seems to do the same...

Need to fix that initramfs script then.  On the other hand, if one
*does* run a scan with no arguments, then it probably is a good idea
to ignore snapshots.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhesjAAoJENRVrw2cjl5RPRAIALK5ERJqRJLSsa6kBSIP8bWe
WP531ew49I0Tkc0o3YOCqq07tb4kZ5rsLsPaLE+s3adCe5/wYzQOox4x6ucak1gK
0igazFx9TYM65YRtFzIUAnj/CPN4WwIInwoAac4w2qwCKB56WUbSU60lEsOmFfRr
6m9EUYkBtMRiWfW2jjuj8iLnBW6glexAqTpW1eKWPfF0AGoUXc8AQboNwceFnHi3
vcjmQM6mhL5zH+FJ0Z/meTk/PwVdjEVJQIEcbMpvggAJeqxsm90GHVIsn8C7B80i
GcX8GHe+Gw3WJMsaW49slKa+MOjWt2SumN/lrKFYPVwQUguhvg0hC1UG5m3cJFo=
=64dH
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Goffredo Baroncelli
On 12/08/2014 04:30 PM, Phillip Susi wrote:
> On 12/4/2014 1:39 PM, Goffredo Baroncelli wrote:
[...]
>> To check if a device is a LVM snapshot, it is checked the 'udev' 
>> device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, 
>> the device has to be skipped.
> 
>> As consequence, btrfs now depends also by the libudev.
> 
> Rather than modify btrfs device scan to link to libudev and ignore the
> caller when commanded to scan a snapshot, wouldn't it be
> simpler/better to just fix the udev rule to not *call* btrfs device
> scan on the snapshot?

I like this approach, but as I wrote before, it seems that 
initramfs executes a "btrfs dev scan" (see my previoue email 
'Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots' 
date 2014/12/03 9:34):

$ grep -r "btrfs dev.*scan" /usr/share/initramfs-tools/
/usr/share/initramfs-tools/scripts/local-premount/btrfs:/sbin/btrfs 
device scan 2> /dev/null

(this is from a debian). 
However it has to be pointed out that
fedora doesn't seems to do the same...

I have to investigate a bit

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Robert White

On 12/07/2014 04:32 PM, Konstantin wrote:

I know this and I'm using 0.9 on purpose. I need to boot from these
disks so I can't use 1.2 format as the BIOS wouldn't recognize the
partitions. Having an additional non-RAID disk for booting introduces a
single point of failure which contrary to the idea of RAID>0.


GRUB2 has raid 1.1 and 1.2 metadata support via the mdraid1x module. LVM 
is also supported. I don't know if a stack of both is supported.


There is, BTW, no such thing as a (commodity) computer without a single 
point of failure in it somewhere. I've watched government contracts 
chase this demon for decades. Be it disk, controller, network card, bus 
chip, cpu or stick-of-ram you've got a single point of failure 
somewhere. Actually you likely have several such points of potential 
failure.


For instance, are you _sure_ your BIOS is going to check the second 
drive if it gets read failure after starting in on your first drive? 
Chances are it won't because that four-hundred bytes-or-so boot loader 
on that first disk has no way to branch back into the bios.


You can waste a lot of your life chasing that ghost and you'll still 
discover you've missed it and have to whip out your backup boot media.


It may well be worth having a second copy of /boot around, but make sure 
you stay out of bandersnatch territory when designing your system. "The 
more you over-think the plumbing, the easier it is to stop up the pipes."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Martin Steigerwald
Am Montag, 8. Dezember 2014, 09:57:50 schrieb Austin S Hemmelgarn:
> On 2014-12-08 09:47, Martin Steigerwald wrote:
> > Hi,
> > 
> > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> >>> Well what would be possible I bet would be a kind of system call like
> >>> this:
> >>> 
> >>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
> >>> can I do it *and* give me a guarentee I can.
> >>> 
> >>> So like a more flexible fallocate approach as fallocate just allocates
> >>> one
> >>> file and you would need to run it for all files you intend to create.
> >>> But
> >>> challenge would be to estimate metadata allocation beforehand
> >>> accurately.
> >>> 
> >>> Or have tar --fallocate -xf which for all files in the archive will
> >>> first
> >>> call fallocate and only if that succeeded, actually write them. But due
> >>> to the nature of tar archives with their content listing across the
> >>> whole
> >>> archive, this means it may have to read the tar archive twice, so ZIP
> >>> archives might be better suited for that.
> >> 
> >> What you suggest is Still Not Practical™ (the tar thing might have some
> >> ability if you were willing to analyze every file to the byte level).
> >> 
> >> Compression _can_ make a file _bigger_ than its base size. BTRFS decides
> >> whether or not to compress a file based on the results it gets when
> >> tying to compress the first N bytes. (I do not know the value of N). But
> >> it is _easy_ to have a file where the first N bytes compress well but
> >> the bytes after N take up more space than their byte count. So to
> >> fallocate() the right size in blocks you'd have to compress the input
> >> and determine what BTRFS _would_ _do_ and then allocate that much space
> >> instead of the file size.
> >> 
> >> And even then, if you didn't create all the names and directories you
> >> might find that the RBtree had to expand (allocate another tree node)
> >> one or more times to accommodate the actual files. Lather rinse repeat
> >> for any checksum trees and anything hitting a flush barrier because of
> >> commit= or sync() events or other writers perturbing your results
> >> because it only matters if the filesystem is nearly full and nearly full
> >> filesystems may not be quiescent at all.
> >> 
> >> So while the core problem isn't insoluble, in real life it is _not_
> >> _worth_ _solving_.
> >> 
> >> On a nearly empty filesystem, it's going to fit.
> >> 
> >> In a reasonably empty filesystem, it's going to fit.
> >> 
> >> On a nearly full filesystem, it may or may not fit.
> >> 
> >> On a filesystem that is so close to full that you have reason to doubt
> >> it will fit, you are going to have a very bad time even if it fits.
> >> 
> >> If you did manage to invent and implement an fallocate algorythm that
> >> could make this promise and make it stick, then some other running
> >> program is what's going to crash when you use up that last byte anyway.
> >> 
> >> Almost full filesystems are their own reward.
> > 
> > So you basically say that BTRFS with compression  does not meet the
> > fallocate guarantee. Now thats interesting, cause it basically violates
> > the
> > documentation for the system call:
> > 
> > DESCRIPTION
> > 
> > The function posix_fallocate() ensures that disk space  is  allo‐
> > cated for the file referred to by the descriptor fd for the bytes
> > in the range starting at offset and  continuing  for  len  bytes.
> > After  a  successful call to posix_fallocate(), subsequent writes
> > to bytes in the  specified  range  are  guaranteed  not  to  fail
> > because of lack of disk space.
> > 
> > So in order to be standard compliant there, BTRFS would need to write
> > fallocated files uncompressed… wow this is getting complex.
> 
> The other option would be to allocate based on the worst case size
> increase for the compression algorithm, (which works out to about 5%
> IIRC for zlib and a bit more for lzo) and then possibly discard the
> unwritten extents at some later point.

Now that seems like a workable solution.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/4/2014 1:39 PM, Goffredo Baroncelli wrote:
> LVM snapshots are a problem for the btrfs devices management. BTRFS
> assumes that each device have an unique 'device UUID'. A LVM
> snapshot breaks this assumption.
> 
> This causes a lot of problems if some btrfs device are
> snapshotted: - the set of devices for a btrfs multi-volume
> filesystem may be mixed (i.e. some NON snapshotted device with some
> snapshotted devices) - /proc/mount may returns a wrong device.
> 
> In the mailing list some posts reported these incidents.
> 
> This patch allows btrfs to skip LVM snapshot during the device scan
>  phase.
> 
> But if you need to consider a LVM snapshot you can set the 
> environment variable BTRFS_SKIP_LVM_SNAPSHOT to "no". In this case 
> the old behavior is applied.
> 
> To check if a device is a LVM snapshot, it is checked the 'udev' 
> device property 'DM_UDEV_LOW_PRIORITY_FLAG' . If it is set to 1, 
> the device has to be skipped.
> 
> As consequence, btrfs now depends also by the libudev.

Rather than modify btrfs device scan to link to libudev and ignore the
caller when commanded to scan a snapshot, wouldn't it be
simpler/better to just fix the udev rule to not *call* btrfs device
scan on the snapshot?


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhcQyAAoJENRVrw2cjl5R1YAIAJlCine4apKnM+01Tw6JpofZ
447+FVizi5SjdgkcjcREyU5zu1pa7ioOTdExF1v1irN1xMUrRBL/RJcRjjnjkvjB
dP8JU0x52MEvQABzQP9ANWJnkMqUJ0j+ryPn+3B7wLP/RtAnIn2P9Vh1EhiLkZ9N
TdxZIPtPROWPTFBl9ONTBghOHjWYEtcDMkuTS6ZhwLh5c1LE8d3A9c68ez++oSGz
TbS51ITFZCEUyF7E/r/xWHhrYagoRM+xdYqVACpi5eY8rFKl3oH4R96gBK8hNdiN
AIOilSsNFscXiflORMAaRquW/7tUolfNt+3TfzTYmaVnK4Hv5h0wiJjiKJhNgDY=
=HlmL
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/7/2014 7:32 PM, Konstantin wrote:
>> I'm guessing you are using metadata format 0.9 or 1.0, which put
>> the metadata at the end of the drive and the filesystem still
>> starts in sector zero.  1.2 is now the default and would not have
>> this problem as its metadata is at the start of the disk ( well,
>> 4k from the start ) and the fs starts further down.
> I know this and I'm using 0.9 on purpose. I need to boot from
> these disks so I can't use 1.2 format as the BIOS wouldn't
> recognize the partitions. Having an additional non-RAID disk for
> booting introduces a single point of failure which contrary to the
> idea of RAID>0.

The bios does not know or care about partitions.  All you need is a
partition table in the MBR and you can install grub there and have it
boot the system from a mdadm 1.1 or 1.2 format array housed in a
partition on the rest of the disk.  The only time you really *have* to
use 0.9 or 1.0 ( and you really should be using 1.0 instead since it
handles larger arrays and can't be confused vis. whole disk vs.
partition components ) is if you are running a raid1 on the raw disk,
with no partition table and then partition inside the array instead,
and really, you just shouldn't be doing that.

> Anyway, to avoid a futile discussion, mdraid and its format is not
> the problem, it is just an example of the problem. Using dm-raid
> would do the same trouble, LVM apparently, too. I could think of a
> bunch of other cases including the use of hardware based RAID
> controllers. OK, it's not the majority's problem, but that's not
> the argument to keep a bug/flaw capable of crashing your system.

dmraid solves the problem by removing the partitions from the
underlying physical device ( /dev/sda ), and only exposing them on the
array ( /dev/mapper/whatever ).  LVM only has the problem when you
take a snapshot.  User space tools face the same issue and they
resolve it by ignoring or deprioritizing the snapshot.

> As it is a nice feature that the kernel apparently scans for drives
> and automatically identifies BTRFS ones, it seems to me that this
> feature is useless. When in a live system a BTRFS RAID disk fails,
> it is not sufficient to hot-replace it, the kernel will not
> automatically rebalance. Commands are still needed for the task as
> are with mdraid. So the only point I can see at the moment where
> this auto-detect feature makes sense is when mounting the device
> for the first time. If I remember the documentation correctly, you
> mount one of the RAID devices and the others are automagically
> attached as well. But outside of the mount process, what is this
> auto-detect used for?
> 
> So here a couple of rather simple solutions which, as far as I can
> see, could solve the problem:
> 
> 1. Limit the auto-detect to the mount process and don't do it when 
> devices are appearing.
> 
> 2. When a BTRFS device is detected and its metadata is identical to
> one already mounted, just ignore it.

That doesn't really solve the problem since you can still pick the
wrong one to mount in the first place.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUhbztAAoJENRVrw2cjl5RomkH/26Q3M6LXVaF0qEcEzFTzGEL
uVAOKBY040Ui5bSK0WQYnH0XtE8vlpLSFHxrRa7Ygpr3jhffSsu6ZsmbOclK64ZA
Z8rNEmRFhOxtFYTcQwcUbeBtXEN3k/5H49JxbjUDItnVPBoeK3n7XG4i1Lap5IdY
GXyLbh7ogqd/p+wX6Om20NkJSx4xzyU85E4ZvDADQA+2RIBaXva5tDPx5/UD4XBQ
h8ai+wS1iC8EySKxwKBEwzwb7+Z6w7nOWO93v/lL34fwTg0OIY9uEfTaAy5KcDjz
z6QXWTmvrbiFpyy/qyGSqBGlPjZ+r98mVEDbYWCVfK8AoD6UmteD7R8WAWkWiWY=
=PJww
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Austin S Hemmelgarn

On 2014-12-08 09:47, Martin Steigerwald wrote:

Hi,

Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:

On 12/07/2014 07:40 AM, Martin Steigerwald wrote:

Well what would be possible I bet would be a kind of system call like
this:

I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
can I do it *and* give me a guarentee I can.

So like a more flexible fallocate approach as fallocate just allocates one
file and you would need to run it for all files you intend to create. But
challenge would be to estimate metadata allocation beforehand accurately.

Or have tar --fallocate -xf which for all files in the archive will first
call fallocate and only if that succeeded, actually write them. But due
to the nature of tar archives with their content listing across the whole
archive, this means it may have to read the tar archive twice, so ZIP
archives might be better suited for that.


What you suggest is Still Not Practical™ (the tar thing might have some
ability if you were willing to analyze every file to the byte level).

Compression _can_ make a file _bigger_ than its base size. BTRFS decides
whether or not to compress a file based on the results it gets when
tying to compress the first N bytes. (I do not know the value of N). But
it is _easy_ to have a file where the first N bytes compress well but
the bytes after N take up more space than their byte count. So to
fallocate() the right size in blocks you'd have to compress the input
and determine what BTRFS _would_ _do_ and then allocate that much space
instead of the file size.

And even then, if you didn't create all the names and directories you
might find that the RBtree had to expand (allocate another tree node)
one or more times to accommodate the actual files. Lather rinse repeat
for any checksum trees and anything hitting a flush barrier because of
commit= or sync() events or other writers perturbing your results
because it only matters if the filesystem is nearly full and nearly full
filesystems may not be quiescent at all.

So while the core problem isn't insoluble, in real life it is _not_
_worth_ _solving_.

On a nearly empty filesystem, it's going to fit.

In a reasonably empty filesystem, it's going to fit.

On a nearly full filesystem, it may or may not fit.

On a filesystem that is so close to full that you have reason to doubt
it will fit, you are going to have a very bad time even if it fits.

If you did manage to invent and implement an fallocate algorythm that
could make this promise and make it stick, then some other running
program is what's going to crash when you use up that last byte anyway.

Almost full filesystems are their own reward.


So you basically say that BTRFS with compression  does not meet the fallocate
guarantee. Now thats interesting, cause it basically violates the
documentation for the system call:

DESCRIPTION
The function posix_fallocate() ensures that disk space  is  allo‐
cated for the file referred to by the descriptor fd for the bytes
in the range starting at offset and  continuing  for  len  bytes.
After  a  successful call to posix_fallocate(), subsequent writes
to bytes in the  specified  range  are  guaranteed  not  to  fail
because of lack of disk space.

So in order to be standard compliant there, BTRFS would need to write
fallocated files uncompressed… wow this is getting complex.
The other option would be to allocate based on the worst case size 
increase for the compression algorithm, (which works out to about 5% 
IIRC for zlib and a bit more for lzo) and then possibly discard the 
unwritten extents at some later point.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.

2014-12-08 Thread Goffredo Baroncelli
On 12/08/2014 03:02 AM, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: [PATCH 1/5] Avoid to consider lvm snapshots when scanning devices.
> From: Goffredo Baroncelli 
> To: 
> Date: 2014年12月05日 02:39
>> LVM snapshots create a problem to the btrfs devices management.
>> BTRFS assumes that each device haw an unique 'device UUID'.
>> A LVM snapshot breaks this assumption.
>>
>> This patch skips LVM snapshots during the device scan phase.
>> If you need to consider a LVM snapshot you have to set the
>> environmental variable BTRFS_SKIP_LVM_SNAPSHOT to "no".
> IMHO, it is better only to skip LVM snapshot if and only if the snapshot 
> contains a btrfs with
> conflicting UUID.

Hi Qu,

Currently the "scan phase" in btrfs is done 1 device at time.
(udev finds a new device and starts "btrfs dev scan "),
and I haven't changed that. This means that btrfs[-prog] doesn't 
know which devices are already registered, or not. And if even 
it would know this information, you have to consider the case
where the snapshot appears before the real target. So 
btrfs[-prog] is not in position to perform this analysis [see
below my other comment]

> Since LVM is such a flexible block level volume management, it is possible 
> that some one
> did a snapshot of a btrfs fs, and then reformat the original one to other fs.
> In that case, the LVM snapshot skip seems overkilled.
> 
> Also, personally, I prefer there will be some option like -i to allow user to 
> choose which device is
> used when conflicting uuid is detected. This seems to be the best case and 
> user can have the full control
> on device scan. This also makes the environment variant not needed.
> 
> LVM snapshot skip (only when conflicting) is better to be the fallback 
> behavior if -i is not given.

I understood your reasons, but I don't find any solution 
compatible with the current btrfs device registration model
(asynchronously when the device appears).

In another patch set, I proposed a mount.btrfs helper which is
in position to perform these analysis and to pick the "right"
device (even with the user suggestion).

Today the lvm-snapshot and btrfs behave very poor: it is not
predictable which device is pick (the original or the snapshot). 
These patch *avoid* most problems skipping the snapshots, which
to me seems a reasonable default.
For the other case the user is still able to mount any disks
[combination] passing them directly via command line (
mount /dev/sdX -o device=/dev/sdY,device=/dev/sdz...  );

Anyway I think for these kind of setup (btrfs on lvm-snapshot), 
passing the disks explicitly is the only solution; in fact your 
suggestion about the '-i' switch is not very different.

> Thanks,
> Qu

BR
G.Baroncelli
>>
>> To check if a device is a LVM snapshot, it is checked the
>> 'udev' device property 'DM_UDEV_LOW_PRIORITY_FLAG' .
>> If it is set to 1, the device has to be skipped.
>>
>> As conseguence, btrfs now depends by libudev.
>>
>> Programmatically you can control this behavior with the functions:
>> - btrfs_scan_set_skip_lvm_snapshot(int new_value)
>> - int btrfs_scan_get_skip_lvm_snapshot( )
>>
>> Signed-off-by: Goffredo Baroncelli 
>> ---
>>   Makefile |   4 +--
>>   utils.c  | 107 
>> +++
>>   utils.h  |   9 +-
>>   3 files changed, 117 insertions(+), 3 deletions(-)
>>
>> diff --git a/Makefile b/Makefile
>> index 4cae30c..9464361 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -26,7 +26,7 @@ TESTS = fsck-tests.sh convert-tests.sh
>>   INSTALL = install
>>   prefix ?= /usr/local
>>   bindir = $(prefix)/bin
>> -lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
>> +lib_LIBS = -luuid -lblkid -lm -lz -ludev -llzo2 -L.
>>   libdir ?= $(prefix)/lib
>>   incdir = $(prefix)/include/btrfs
>>   LIBS = $(lib_LIBS) $(libs_static)
>> @@ -99,7 +99,7 @@ lib_links = libbtrfs.so.0 libbtrfs.so
>>   headers = $(libbtrfs_headers)
>> # make C=1 to enable sparse
>> -check_defs := .cc-defines.h
>> +check_defs := .cc-defines.h
>>   ifdef C
>>   #
>>   # We're trying to use sparse against glibc headers which go wild
>> diff --git a/utils.c b/utils.c
>> index 2a92416..9887f8b 100644
>> --- a/utils.c
>> +++ b/utils.c
>> @@ -29,6 +29,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include 
>> @@ -52,6 +53,13 @@
>>   #define BLKDISCARD_IO(0x12,119)
>>   #endif
>>   +/*
>> + * This variable controls if the lvm snapshot have to be skipped or not.
>> + * Access this variable only via the btrfs_scan_[sg]et_skip_lvm_snapshot()
>> + * functions
>> + */
>> +static int __scan_device_skip_lvm_snapshot = -1;
>> +
>>   static int btrfs_scan_done = 0;
>> static char argv0_buf[ARGV0_BUF_SIZE] = "btrfs";
>> @@ -1593,6 +1601,9 @@ int btrfs_scan_block_devices(int run_ioctl)
>>   char fullpath[110];
>>   int scans = 0;
>>   int special;
>> +int skip_snapshot;
>> +
>> +skip_snapshot = btrfs_scan_get_skip_lvm_snapshot();

Re: Possible to undo subvol delete?

2014-12-08 Thread Austin S Hemmelgarn

On 2014-12-08 09:16, Shriramana Sharma wrote:

On Mon, Dec 8, 2014 at 6:31 PM, Austin S Hemmelgarn
 wrote:

Personally, I prefer a somewhat hybrid approach where everyone has *sbin in
their path, but file permissions are used to control what non-administrators
can run.


This is exactly the same approach as Ubuntu, since non-superuser can't
really do anything active (whether creating or deleting) with */sbin
commands, but only querying (like ifconfig, btrfs subvol list etc). So
this is not really hybrid of anything it seems.

IIRC, Ubuntu relies on the fact that normal users don't have the 
capabilities required for the privileged operations, as opposed to just 
not letting them run the binaries at all.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Martin Steigerwald
Hi,

Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> > Well what would be possible I bet would be a kind of system call like
> > this:
> > 
> > I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
> > can I do it *and* give me a guarentee I can.
> > 
> > So like a more flexible fallocate approach as fallocate just allocates one
> > file and you would need to run it for all files you intend to create. But
> > challenge would be to estimate metadata allocation beforehand accurately.
> > 
> > Or have tar --fallocate -xf which for all files in the archive will first
> > call fallocate and only if that succeeded, actually write them. But due
> > to the nature of tar archives with their content listing across the whole
> > archive, this means it may have to read the tar archive twice, so ZIP
> > archives might be better suited for that.
> 
> What you suggest is Still Not Practical™ (the tar thing might have some
> ability if you were willing to analyze every file to the byte level).
> 
> Compression _can_ make a file _bigger_ than its base size. BTRFS decides
> whether or not to compress a file based on the results it gets when
> tying to compress the first N bytes. (I do not know the value of N). But
> it is _easy_ to have a file where the first N bytes compress well but
> the bytes after N take up more space than their byte count. So to
> fallocate() the right size in blocks you'd have to compress the input
> and determine what BTRFS _would_ _do_ and then allocate that much space
> instead of the file size.
> 
> And even then, if you didn't create all the names and directories you
> might find that the RBtree had to expand (allocate another tree node)
> one or more times to accommodate the actual files. Lather rinse repeat
> for any checksum trees and anything hitting a flush barrier because of
> commit= or sync() events or other writers perturbing your results
> because it only matters if the filesystem is nearly full and nearly full
> filesystems may not be quiescent at all.
> 
> So while the core problem isn't insoluble, in real life it is _not_
> _worth_ _solving_.
> 
> On a nearly empty filesystem, it's going to fit.
> 
> In a reasonably empty filesystem, it's going to fit.
> 
> On a nearly full filesystem, it may or may not fit.
> 
> On a filesystem that is so close to full that you have reason to doubt
> it will fit, you are going to have a very bad time even if it fits.
> 
> If you did manage to invent and implement an fallocate algorythm that
> could make this promise and make it stick, then some other running
> program is what's going to crash when you use up that last byte anyway.
> 
> Almost full filesystems are their own reward.

So you basically say that BTRFS with compression  does not meet the fallocate 
guarantee. Now thats interesting, cause it basically violates the 
documentation for the system call:

DESCRIPTION
   The function posix_fallocate() ensures that disk space  is  allo‐
   cated for the file referred to by the descriptor fd for the bytes
   in the range starting at offset and  continuing  for  len  bytes.
   After  a  successful call to posix_fallocate(), subsequent writes
   to bytes in the  specified  range  are  guaranteed  not  to  fail
   because of lack of disk space.

So in order to be standard compliant there, BTRFS would need to write 
fallocated files uncompressed… wow this is getting complex.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Goffredo Baroncelli
On 12/08/2014 01:12 AM, ashf...@whisperpc.com wrote:
> Goffredo,
> 
>> So in case you have a raid1 filesystem on two disks; each disk has 300GB
>> free; which is the free space that you expected: 300GB or 600GB and why ?
> 
> You should see 300GB free.  That's what you'll see with RAID-1 with a
> hardware RAID controller, and with MD RAID.  Why would you expect to see
> anything else with BTRFS RAID?

I had to ask you because in a your previous email you stated something
different:

On 12/07/2014 09:32 PM, ashf...@whisperpc.com wrote:
> I disagree.  My experiences with other file-systems, including ZFS, show
> that the most common solution is to just deliver to the user the actual
> amount of *unused disk space*
^^^

So I expected that you answered with 600GB. But you have told the true:
the user want to know how many data is able to store on the disk, and
not the unused disk space.

But I have to point out that the common case is one disk filesystem
where the metadata chunks have a ratio data stored/disk space 
consumed of 1:2; the data chunks have a ratio of 1:1. This is one
reason why is difficult to evaluate the free space: if you have
all metadata chunks, you have to half the disk space. 
Another reason is that there is the idea to allow different raid 
profiles in the same filesystem. This will further complicate the
free space evaluation.


> Peter Ashford

G.Baroncelli
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to undo subvol delete?

2014-12-08 Thread Shriramana Sharma
On Mon, Dec 8, 2014 at 6:31 PM, Austin S Hemmelgarn
 wrote:
> Personally, I prefer a somewhat hybrid approach where everyone has *sbin in
> their path, but file permissions are used to control what non-administrators
> can run.

This is exactly the same approach as Ubuntu, since non-superuser can't
really do anything active (whether creating or deleting) with */sbin
commands, but only querying (like ifconfig, btrfs subvol list etc). So
this is not really hybrid of anything it seems.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix fs corruption on transaction abort if device supports discard

2014-12-08 Thread Filipe David Manana
On Mon, Dec 8, 2014 at 1:53 PM, Chris Mason  wrote:
> On Sun, Dec 7, 2014 at 4:31 PM, Filipe Manana  wrote:
>>
>> When we abort a transaction we iterate over all the ranges marked as dirty
>> in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
>> from those trees, add them back (unpin) to the free space caches and, if
>> the fs was mounted with "-o discard", perform a discard on those regions.
>> Also, after adding the regions to the free space caches, a fitrim ioctl
>> call
>> can see those ranges in a block group's free space cache and perform a
>> discard
>> on the ranges, so the same issue can happen without "-o discard" as well.
>>
>> This causes corruption, affecting one or multiple btree nodes (in the
>> worst
>> case leaving the fs unmountable) because some of those ranges (the ones in
>> the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
>> referred by the last committed super block - breaking the rule that
>> anything
>> that was committed by a transaction is untouched until the next
>> transaction
>> commits successfully.
>
>
> This is great work Filipe, thank you!
>
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7e2405a..ce65b0c 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -4134,12 +4134,6 @@ again:
>> if (ret)
>> break;
>>
>> -   /* opt_discard */
>> -   if (btrfs_test_opt(root, DISCARD))
>> -   ret = btrfs_error_discard_extent(root, start,
>> -end + 1 - start,
>> -NULL);
>> -
>
>
> While you're here, can you please just delete btrfs_error_discard_extent and
> use btrfs_discard_extent directly?  It's already being used in non-error
> cases, and since it only discards, I don't see how we want to do that on
> errors anyway.

Agreed, the function doesn't make sense and its name is confusing
since it's just an alias to btrfs_discard_extent().
I've sent a separate cleanup patch to remove it:
https://patchwork.kernel.org/patch/5456261/

thanks

>
> -chris
>
>
>>
>> clear_extent_dirty(unpin, start, end, GFP_NOFS);
>> btrfs_error_unpin_extent_range(root, start, end);
>> cond_resched();
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index c2fc261..fc74a9b 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -5728,7 +5728,8 @@ void btrfs_prepare_extent_commit(struct
>> btrfs_trans_handle *trans,
>> update_global_block_rsv(fs_info);
>>  }
>>
>> -static int unpin_extent_range(struct btrfs_root *root, u64 start, u64
>> end)
>> +static int unpin_extent_range(struct btrfs_root *root, u64 start, u64
>> end,
>> + const bool return_free_space)
>>  {
>> struct btrfs_fs_info *fs_info = root->fs_info;
>> struct btrfs_block_group_cache *cache = NULL;
>> @@ -5752,7 +5753,8 @@ static int unpin_extent_range(struct btrfs_root
>> *root, u64 start, u64 end)
>>
>> if (start < cache->last_byte_to_unpin) {
>> len = min(len, cache->last_byte_to_unpin - start);
>> -   btrfs_add_free_space(cache, start, len);
>> +   if (return_free_space)
>> +   btrfs_add_free_space(cache, start, len);
>> }
>>
>> start += len;
>> @@ -5816,7 +5818,7 @@ int btrfs_finish_extent_commit(struct
>> btrfs_trans_handle *trans,
>>end + 1 - start, NULL);
>>
>> clear_extent_dirty(unpin, start, end, GFP_NOFS);
>> -   unpin_extent_range(root, start, end);
>> +   unpin_extent_range(root, start, end, true);
>> cond_resched();
>> }
>>
>> @@ -9705,7 +9707,7 @@ out:
>>
>>  int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start,
>> u64 end)
>>  {
>> -   return unpin_extent_range(root, start, end);
>> +   return unpin_extent_range(root, start, end, false);
>>  }
>>
>>  int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
>> --
>> 2.1.3
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove non-sense btrfs_error_discard_extent() function

2014-12-08 Thread Filipe Manana
It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ctree.h|  4 ++--
 fs/btrfs/extent-tree.c  | 10 ++
 fs/btrfs/free-space-cache.c |  4 ++--
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c7e5f2a..399a4e0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3429,8 +3429,8 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info 
*info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 int btrfs_error_unpin_extent_range(struct btrfs_root *root,
   u64 start, u64 end);
-int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
-  u64 num_bytes, u64 *actual_bytes);
+int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
+u64 num_bytes, u64 *actual_bytes);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 type);
 int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc74a9b..18b63acc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1891,8 +1891,8 @@ static int btrfs_issue_discard(struct block_device *bdev,
return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
 }
 
-static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
-   u64 num_bytes, u64 *actual_bytes)
+int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
+u64 num_bytes, u64 *actual_bytes)
 {
int ret;
u64 discarded_bytes = 0;
@@ -9710,12 +9710,6 @@ int btrfs_error_unpin_extent_range(struct btrfs_root 
*root, u64 start, u64 end)
return unpin_extent_range(root, start, end, false);
 }
 
-int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
-  u64 num_bytes, u64 *actual_bytes)
-{
-   return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes);
-}
-
 int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index edf32c5..d6c03f7 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2966,8 +2966,8 @@ static int do_trimming(struct btrfs_block_group_cache 
*block_group,
spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
 
-   ret = btrfs_error_discard_extent(fs_info->extent_root,
-start, bytes, &trimmed);
+   ret = btrfs_discard_extent(fs_info->extent_root,
+  start, bytes, &trimmed);
if (!ret)
*total_trimmed += trimmed;
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix fs corruption on transaction abort if device supports discard

2014-12-08 Thread Chris Mason

On Sun, Dec 7, 2014 at 4:31 PM, Filipe Manana  wrote:
When we abort a transaction we iterate over all the ranges marked as 
dirty

in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, 
if
the fs was mounted with "-o discard", perform a discard on those 
regions.
Also, after adding the regions to the free space caches, a fitrim 
ioctl call
can see those ranges in a block group's free space cache and perform 
a discard
on the ranges, so the same issue can happen without "-o discard" as 
well.


This causes corruption, affecting one or multiple btree nodes (in the 
worst
case leaving the fs unmountable) because some of those ranges (the 
ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs 
that are
referred by the last committed super block - breaking the rule that 
anything
that was committed by a transaction is untouched until the next 
transaction

commits successfully.


This is great work Filipe, thank you!



diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7e2405a..ce65b0c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4134,12 +4134,6 @@ again:
if (ret)
break;

-   /* opt_discard */
-   if (btrfs_test_opt(root, DISCARD))
-   ret = btrfs_error_discard_extent(root, start,
-end + 1 - start,
-NULL);
-


While you're here, can you please just delete 
btrfs_error_discard_extent and use btrfs_discard_extent directly?  It's 
already being used in non-error cases, and since it only discards, I 
don't see how we want to do that on errors anyway.


-chris



clear_extent_dirty(unpin, start, end, GFP_NOFS);
btrfs_error_unpin_extent_range(root, start, end);
cond_resched();
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c2fc261..fc74a9b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5728,7 +5728,8 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,

update_global_block_rsv(fs_info);
 }

-static int unpin_extent_range(struct btrfs_root *root, u64 start, 
u64 end)
+static int unpin_extent_range(struct btrfs_root *root, u64 start, 
u64 end,

+ const bool return_free_space)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_group_cache *cache = NULL;
@@ -5752,7 +5753,8 @@ static int unpin_extent_range(struct btrfs_root 
*root, u64 start, u64 end)


if (start < cache->last_byte_to_unpin) {
len = min(len, cache->last_byte_to_unpin - start);
-   btrfs_add_free_space(cache, start, len);
+   if (return_free_space)
+   btrfs_add_free_space(cache, start, len);
}

start += len;
@@ -5816,7 +5818,7 @@ int btrfs_finish_extent_commit(struct 
btrfs_trans_handle *trans,

   end + 1 - start, NULL);

clear_extent_dirty(unpin, start, end, GFP_NOFS);
-   unpin_extent_range(root, start, end);
+   unpin_extent_range(root, start, end, true);
cond_resched();
}

@@ -9705,7 +9707,7 @@ out:

 int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 
start, u64 end)

 {
-   return unpin_extent_range(root, start, end);
+   return unpin_extent_range(root, start, end, false);
 }

 int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
--
2.1.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to undo subvol delete?

2014-12-08 Thread Austin S Hemmelgarn

On 2014-12-05 13:11, Shriramana Sharma wrote:

OK so from https://forums.opensuse.org/showthread.php/440209-ifconfig
I learnt that it's because /sbin, /usr/sbin etc is not on the normal
user's path on openSUSE (they are, on Kubuntu). Adding them to PATH
fixes the situation. (I wasn't even able to do ifconfig without giving
the password. No idea why this is the openSUSE default...)
Probably because OpenSUSE/SLES are designed as enterprise distributions, 
and their primary use case is having a very small number of sysadmins 
and a potentially large number of normal users.  Ubuntu et al. are 
designed primarily for PC's, where everyone is assumed to be equivalent 
to an administrator.  Personally, I prefer a somewhat hybrid approach 
where everyone has *sbin in their path, but file permissions are used to 
control what non-administrators can run.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread Chris Murphy
On Sun, Dec 7, 2014 at 1:32 PM,   wrote:
>
> I disagree.  My experiences with other file-systems, including ZFS, show
> that the most common solution is to just deliver to the user the actual
> amount of unused disk space.  Anything else changes this known value into
> a guess or prediction.

What is the "actual amount of unused disk space" in a 2x 8GB drives
mirror? Very literally, it's 16GB. It's a convenience subtracting the
space used for replication (the n mirror copies, or parity). This is
in fact how df reported Btrfs volumes with kernel 3.16 and older.

A ZFS mirror vdev doesn't work this way, it reports available space as
8GB. The level of replication and number of devices is a function of
the vdev, and is fixed. It can't be changed. With Btrfs there isn't a
zpool vs vdev type of distinction, and replication level isn't a
function of volume but rather that of chunks. At some future point
there will be a way to supply a hint (per subvolume, maybe per
directory or per file) for the allocator to put the file in a
particular chunk which has a particular level of replication and
number of devices. And that means "available space" isn't knowable.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is the actual disk usage of btrfs considered unknowable?

2014-12-08 Thread ashford
>
>  Original Message 
> Subject: Re: Why is the actual disk usage of btrfs considered unknowable?
> From: 
> To: 
> Date: 2014年12月08日 08:12
>> Goffredo,
>>
>>> So in case you have a raid1 filesystem on two disks; each disk has
>>> 300GB
>>> free; which is the free space that you expected: 300GB or 600GB and why
>>> ?
>> You should see 300GB free.  That's what you'll see with RAID-1 with a
>> hardware RAID controller, and with MD RAID.  Why would you expect to see
>> anything else with BTRFS RAID?
>>
>> Peter Ashford
> Yeah, you pointed out the real problem here:
>
> [DIFFERENT RESULT FROM DIFFERENT VIEW]
> See from *PURE ON-DISK* usage, it is still 600G, no matter what level of
> RAID.
> See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is
> build on BLOCK LEVEL RAID1,
> then the *FILESYSTEM* usage will also be 300G
>
> [BTRFS DOES NOT BELONG TO ANY TYPE]
> But, btrfs is neither pure block level management(that should be MD or
> HW RAID or LVM), nor a
> traditional filesystem!!

For the purposes of reporting free space, it is reasonable to assume that
the default structure will be used.  If the default for the volume or
subvolume is RAID-1, then that should be used for 'df' output.  Obviously,
the same should be done for other RAID levels.

> So the root of the problem is, btrfs mixs the position of block level
> management and filesystem level
> management, which makes everything hard to understand.
> You can't treat btrfs raid1 as a complete block level raid1, due to its
> flexibility on metadata/data profile different.

It will have the same discrepancies as other file-systems with
compression, plus a few more of its own, due to chunking.  If the
file-system can't give a completely accurate answer, it should give one
that makes sense.

> If vanilla df command shows filesystem level freespace, then btrfs won't
> give a accurate on.
>
> [ONLY PREDICABLE CASE]
> For the 300Gx2 case for btrfs, you can only consider it 300G free space
> only if you can ensure that
> there was/is and will be only RAID1 data/metadata storing on it.(also
> need to ignore small space usage on CoW)

I disagree.  You can consider the RAID structure to be whatever the
default structure is.  If the default is RAID-1, then that structure
should be used to compute the free space for 'df'.  The user should
understand that by explicitly requesting a different RAID structure,
different amounts of space will be used.

> [RELIABLE DATA IS ON-DISK USAGE]
> Only pure on-disk level usage is *a little* reliable. There is still
> problem for unbalanced metadata/data chunk
> allocation problem(e.g, all space is allocated for data, no space for
> metadata CoW write).

I agree.  Unused disk space isn't always available to be used by data. 
Sometimes it's reserved for metadata of one sort or another, and sometimes
it's too small to be of use.  In addition, BTRFS sometimes (with small
files) uses the Metadata chunks for data.  Yes, it's a complex problem. 
There is no simple solution that will make everyone happy.

-

As for the 'df' output, I believe that the default should be the sum of
free space in data chunks, free space in metadata chunks and unallocated
space, ignoring any amounts that are small enough that BTRFS won't use
them, and adjusted for the RAID level of the volume/subvolume.

While it's possible to generate other values that will make sense for
specific cases, it's not possible to create one value that is correct in
all cases.

If it's not possible to be absolutely correct, considering every usage (or
even the most common usages), a 'reasonable' value should be returned. 
That reasonable value should be based on the default volume/subvolume
settings, including RAID levels and any space limits that may exist on the
volume or subvolume.  It should neither be the most optimistic nor the
most pessimistic.

Peter Ashford

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html