Re: Transactional btrfs

2018-09-08 Thread Martin Raiber
Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
>> On 2018-09-06 03:23, Nathan Dehnel wrote:
>>> So I guess my question is, does btrfs support atomic writes across
>>> multiple files? Or is anyone interested in such a feature?
>>>
>> I'm fairly certain that it does not currently, but in theory it would not be
>> hard to add.
>>
>> Realistically, the only cases I can think of where cross-file atomic
>> _writes_ would be of any benefit are database systems.
>>
>> However, if this were extended to include rename, unlink, touch, and a
>> handful of other VFS operations, then I can easily think of a few dozen use
>> cases.  Package managers in particular would likely be very interested in
>> being able to atomically rename a group of files as a single transaction, as
>> it would make their job _much_ easier.
> I wonder, what about:
> sync; mount -o remount,commit=999,flushoncommit
> eatmydata apt dist-upgrade
> sync; mount -o remount,commit=30,noflushoncommit
>
> Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> whole system (if you have unrelated writes they won't get committed until
> the end of transaction).  Then there are nocow files, but you already made
> the decision to disable most features of btrfs for them.
>
> So unless something forces a commit, this should already work, giving
> cross-file atomic writes, renames and so on.

Now combine this with snapshot root, then on success rename exchange to
root and you are there.

Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
think), but no rollback (and therefore no error handling incl. ENOSPC).

If you want to look at a working file system transaction mechanism, you
should look at transactional NTFS (TxF). They are writing they are
deprecating it, so it's perhaps not very widely used. Windows uses it
for updates, I think.

Specifically for btrfs, the problem would be that it really needs to
support multiple simultaneous writers, otherwise one transaction can
block the whole system.




Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-09-06 03:23, Nathan Dehnel wrote:
> > So I guess my question is, does btrfs support atomic writes across
> > multiple files? Or is anyone interested in such a feature?
> > 
> I'm fairly certain that it does not currently, but in theory it would not be
> hard to add.
> 
> Realistically, the only cases I can think of where cross-file atomic
> _writes_ would be of any benefit are database systems.
> 
> However, if this were extended to include rename, unlink, touch, and a
> handful of other VFS operations, then I can easily think of a few dozen use
> cases.  Package managers in particular would likely be very interested in
> being able to atomically rename a group of files as a single transaction, as
> it would make their job _much_ easier.

I wonder, what about:
sync; mount -o remount,commit=999,flushoncommit
eatmydata apt dist-upgrade
sync; mount -o remount,commit=30,noflushoncommit

Obviously, this gets fooled by fsyncs, and makes the transaction affects the
whole system (if you have unrelated writes they won't get committed until
the end of transaction).  Then there are nocow files, but you already made
the decision to disable most features of btrfs for them.

So unless something forces a commit, this should already work, giving
cross-file atomic writes, renames and so on.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Sat, Sep 08, 2018 at 08:45:47PM +, Martin Raiber wrote:
> Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> > On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> >> On 2018-09-06 03:23, Nathan Dehnel wrote:
> >>> So I guess my question is, does btrfs support atomic writes across
> >>> multiple files? Or is anyone interested in such a feature?
> >>>
> >> I'm fairly certain that it does not currently, but in theory it would not 
> >> be
> >> hard to add.

> >> However, if this were extended to include rename, unlink, touch, and a
> >> handful of other VFS operations, then I can easily think of a few dozen use
> >> cases.  Package managers in particular would likely be very interested in
> >> being able to atomically rename a group of files as a single transaction, 
> >> as
> >> it would make their job _much_ easier.

> > I wonder, what about:
> > sync; mount -o remount,commit=999,flushoncommit
> > eatmydata apt dist-upgrade
> > sync; mount -o remount,commit=30,noflushoncommit
> >
> > Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> > whole system (if you have unrelated writes they won't get committed until
> > the end of transaction).  Then there are nocow files, but you already made
> > the decision to disable most features of btrfs for them.

> Now combine this with snapshot root, then on success rename exchange to
> root and you are there.

No need: no unsuccessful transactions ever get written to the disk.
(Not counting unreachable stuff.)

> Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
> think), but no rollback (and therefore no error handling incl. ENOSPC).
> 
> If you want to look at a working file system transaction mechanism, you
> should look at transactional NTFS (TxF). They are writing they are
> deprecating it, so it's perhaps not very widely used. Windows uses it
> for updates, I think.

You're talking about multiple simultaneous transactions, they have a massive
complexity cost.  And btrfs is already ridiculously complex.  I don't really
see a good way to tie this with the POSIX API without some serious
rethinking.

dpkg can already recover from a properly returned error (although not as
nicely as a full rollback); what is fatal for it is having its status
database corrupted/out of sync.  That's why it does a multiple fsync dance
and keeps fully rewriting its files over and over and over.

Atomic operations are pretty useful even without a rollback: you still need
to be able to handle failure, but not a crash.

> Specifically for btrfs, the problem would be that it really needs to
> support multiple simultaneous writers, otherwise one transaction can
> block the whole system.

My dirty hack above doesn't suffer from such a block: it only suffers from
compromising durability of concurrent writers.  During that userspace
transaction, there are no commits until it finishes; this means that if
there's unrelated activity it may suffer from losing writes that were done
between transaction start and crash.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: btrfs send hung in pipe_wait

2018-09-08 Thread Chris Murphy
I don't see any blocked tasks. I wonder if you were too fast with
sysrq w? Maybe it takes a little bit for the block task to actually
develop?

I suggest also 'btrfs check --mode=lowmem' because that is a separate
implementation of btrfs check that tends to catch different things
than the original. It is slow, however.

-- 
Chris Murphy


Re: Does ssd auto detected work for microSD cards?

2018-09-08 Thread GWB
Great, thank you.  That would make sense, but I might have to specify
something for the mmcblk devices.

Here is the terminal output when the MicroSD is inserted to the USB 3 holder:

$ mount | grep btrfs
$ /dev/sdb3 on / type btrfs (rw,ssd,subvol=@)
$ /dev/sdb3 on /home type btrfs (rw,ssd,subvol=@home)
$ /dev/sdd1 on /media/gwb09/btrfs-32G-MicroSDc type btrfs
(rw,nosuid,nodev,uhelper=udisks2)

$ cat /sys/block/sdd/queue/rotational
1

Now the same MicroSD in the SD slot on the computer:

$ mount | grep btrfs
$ /dev/sdb3 on / type btrfs (rw,ssd,subvol=@)
$ /dev/sdb3 on /home type btrfs (rw,ssd,subvol=@home)
$ /dev/mmcblk0p1 on /media/gwb09/btrfs-32G-MicroSDc type btrfs
(rw,nosuid,nodev,uhelper=udisks2)

$ cat /sys/block/mmcblk0/queue/rotational
0

So Ubuntu 14 knows the mmcblk is non-rotational.  It also looks as if
the block device has a partition table of some sort by the existence
of:

/sys/block/mmcblk0/mmcblk0p1

I will see what happens after I install Ubuntu 18.  I probably
specified the mount options for /dev/sdb in /etc/fstab by using a
UUID.  I'll probably tweak the ssd mounts, using ssd_spread, ssd, etc.
at some point.

I've been using nilfs2 for this, but it occurs to me that btrfs will
have more support on more platforms and on more OS's.  There's also a
mounting issue for nilfs2 in Ubuntu 14 which prevents the nilfs-clean
daemon from starting.  I will see if F2FS is in the kernel of the
other machines here.

No complaints here, just gratitude, for the money, time and effort on
the part of tech firms that support and develop btrfs.  I think Oracle
developed the first blueprints for btrfs, but I might be wrong.
Oracle also, of course, caught vast amounts of flak from some of the
open source zfs devs for changing the dev model after buying Sun.  But
I have no idea what parts of Sun would have survived without a buyer.

Gordon
On Mon, Sep 3, 2018 at 11:22 PM Chris Murphy  wrote:
>
> On Mon, Sep 3, 2018 at 7:53 PM, GWB  wrote:
> > Curious instance here, but perhaps this is the expected behaviour:
> >
> > mount | grep btrfs
> > /dev/sdb3 on / type btrfs (rw,ssd,subvol=@)
> > /dev/sdb3 on /home type btrfs (rw,ssd,subvol=@home)
> > /dev/sde1 on /media/gwb09/btrfs-32G-MicroSDc type btrfs
> > (rw,nosuid,nodev,uhelper=udisks2)
> >
> > This is on an Ubuntu 14 client.
> >
> > /dev/sdb is indeed an ssd, a Samsung 850 EVO 500Gig, where Ubuntu runs
> > on btrfs root.   It appears btrfs did indeed auto detected an ssd
> > drive.   However:
> >
> > /dev/sde is a micro SD card (32Gig Samsung) sitting in a USB 3 card
> > reader, inserted into a USB 3 card slot.  But ssh is not detected.
> >
> > So is that the expected behavior?
>
> cat /sys/block/sde/queue/rotational
>
> That's what Btrfs uses for detection. I'm willing to bet the SD Card
> slot is not using the mmc driver, but instead USB and therefore always
> treated as a rotational device.
>
>
> > If not, does it make a difference?
> >
> > Would it be best to mount an sd card with ssd_spread?
>
> For the described use case, it probably doesn't make much of a
> difference. It sounds like these are fairly large contiguous files,
> ZFS send files.
>
> I think for both SDXC and eMMC, F2FS is probably more applicable
> overall than Btrfs due to its reduced wandering trees problem. But
> again for your use case it may not matter much.
>
>
>
> > Yet another side note: both btrfs and zfs are now "housed" at Oracle
> > (and most of java, correct?).
>
> Not really. The ZFS we care about now is OpenZFS, forked from Oracle's
> ZFS. And a bunch of people not related to Oracle do that work. And
> Btrfs has a wide assortment of developers: Facebook, SUSE, Fujitsu,
> Oracle, and more.
>
>
> --
> Chris Murphy


Fwd: btrfs send hung in pipe_wait

2018-09-08 Thread Stefan Loewen
And this one as well.

-- Forwarded message -
From: Chris Murphy 
Date: Fr., 7. Sep. 2018 um 23:53 Uhr
Subject: Re: btrfs send hung in pipe_wait
To: Stefan Loewen 
Cc: Chris Murphy 


On Fri, Sep 7, 2018 at 3:19 PM, Stefan Loewen  wrote:
> Now I also tested with Fedora 28 (linux 4.16) from live-usb (so baremetal).
> Same result.
> Thanks for the pointer towards sys requests. sysrq-w is empty, but I
> sent a bunch of other sysrqs to get stacktraces etc. from the kernel.
> Logs are attached.

Needs a dev to take a look at this, or someone with usb or block
device knowledge to see if something unrelated to btrfs is hung up. I
can't parse it.

Using btrfs-progs 4.17.1, what do you get for 'btrfs check
--mode=lowmem ' ?



--
Chris Murphy


Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-08 Thread Duncan
Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:

> sorry for disturb this discussion,
> 
> are there any plans/dates to fix the raid5/6 issue? Is somebody working
> on this issue? Cause this is for me one of the most important things for
> a fileserver, with a raid1 config I loose to much diskspace.

There's a more technically complete discussion of this in at least two 
earlier threads you can find on the list archive, if you're interested, 
but here's the basics (well, extended basics...) from a btrfs-using-
sysadmin perspective.

"The raid5/6 issue" can refer to at least three conceptually separate 
issues, with different states of solution maturity:

1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
the historic) in current kernels and tools.  Unfortunately these will 
still affect for some time many users of longer-term stale^H^Hble distros 
who don't update using other sources for some time, as because the raid56 
feature wasn't yet stable at the lock-in time for whatever versions they 
stabilized on, they're not likely to get the fixes as it's new-feature 
material.

If you're using a current kernel and tools, however, this issue is 
fixed.  You can look on the wiki for the specific versions, but with the 
4.18 kernel current latest stable, it or 4.17, and similar tools versions 
since the version numbers are synced, are the two latest release series, 
with the two latest release series being best supported and considered 
"current" on this list.

Also see...

2) General feature maturity:  While raid56 mode should be /reasonably/ 
stable now, it remains one of the newer features and simply hasn't yet 
had the testing of time that tends to flush out the smaller and corner-
case bugs, that more mature features such as raid1 have now had the 
benefit of.

There's nothing to do for this but test, report any bugs you find, and 
wait for the maturity that time brings.

Of course this is one of several reasons we so strongly emphasize and 
recommend "current" on this list, because even for reasonably stable and 
mature features such as raid1, btrfs itself remains new enough that they 
still occasionally get latent bugs found and fixed, and while /some/ of 
those fixes get backported to LTS kernels (with even less chance for 
distros to backport tools fixes), not all of them do and even when they 
do, current still gets the fixes first.

3) The remaining issue is the infamous parity-raid write-hole that 
affects all parity-raid implementations (not just btrfs) unless they take 
specific steps to work around the issue.

The first thing to point out here again is that it's not btrfs-specific.  
Between that and the fact that it *ONLY* affects parity-raid operating in 
degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
be argued not to be a btrfs issue at all, but rather one inherent to 
parity-raid mode and considered an acceptable risk to those choosing 
parity-raid because it's only a factor when operating degraded, if an 
ungraceful shutdown does occur.

But btrfs' COW nature along with a couple technical implementation 
factors (the read-modify-write cycle for incomplete stripe widths and how 
that risks existing metadata when new metadata is written) does amplify 
the risk somewhat compared to that seen with the same write-hole issue in 
various other parity-raid implementations that don't avoid it due to 
taking write-hole avoidance countermeasures.


So what can be done right now?

As it happens there is a mitigation the admin can currently take -- btrfs 
allows specifying data and metadata modes separately, and even where 
raid1 loses too much space to be used for both, it's possible to specify 
data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
loss of a single device, it doesn't have the parity-raid write-hole as 
it's not parity-raid, and for most use-cases at least, specifying raid1 
for metadata only, while raid5 for data, should strictly limit both the 
risk of the parity-raid write-hole as it'll be limited to data which in 
most cases will be full-stripe writes and thus not subject to the 
problem, and the size-doubling of raid1 as it'll be limited to metadata.

Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
first rule of backups, that the true value of data isn't defined by 
arbitrary claims, but by the number of backups it is considered worth the 
time/trouble/resources to have of that data, it's a known parity-raid 
risk specifically limited to the corner-case of having an ungraceful 
shutdown *WHILE* already operating degraded, and as such, it can be 
managed along with all the other known risks to the data, including admin 
fat-fingering, the risk that more devices will go out than the array can 
tolerate, the risk of general bugs affecting the filesystem or other 
storage-function related code, etc.

IOW, in the context of the admin's first rule of backups, no matter the 
issue, 

Fwd: btrfs send hung in pipe_wait

2018-09-08 Thread Stefan Loewen
Oops. Forgot CCing the mailinglist

-- Forwarded message -
From: Stefan Loewen 
Date: Fr., 7. Sep. 2018 um 23:19 Uhr
Subject: Re: btrfs send hung in pipe_wait
To: Chris Murphy 


No it does not only happen in VirtualBox. I already tested the following:
- Manjaro baremetal (btrfs-progs v4.17.1; linux v4.18.5 and v4.14.67)
- Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
- ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
The logs I posted until now were mostly (iirc all of them) from the VM
with Arch.

Now I also tested with Fedora 28 (linux 4.16) from live-usb (so baremetal).
Same result.
Thanks for the pointer towards sys requests. sysrq-w is empty, but I
sent a bunch of other sysrqs to get stacktraces etc. from the kernel.
Logs are attached.

To recap:
I copied (reflink) a 3.8G iso file from a read-only snapshot (A) into
a new subvol (B) just to keep things small and managable.
There's nothing special about this file other than that it happens to
be one of the files to trigger the later btrfs-send to hang.
Not all files from A do this, but there are definitely multiple and
the problem only occurs when the files are reflinked.
I then create a snapshot (C) of B to be able to btrfs-send it.
Then I run "btrfs send snap-C > /somewhere", (--no-data leads to the
same) that process reads some MB from the source disk and writes a few
bytes to /somewhere and then just hangs without any further IO.
This is where I issued some sysrqs. The output is attached.
Then I tried killing the btrfs-send with ctrl-c and issued the sysrqs
again. I have no idea if that changed anything, but it didn't hurt, so
why not.

I'll try to minimize the dataset and maybe get a small fs-image
without too much personal information that I can upload so the issue
is reproducible by others.

Am Fr., 7. Sep. 2018 um 21:17 Uhr schrieb Chris Murphy
:
>
> On Fri, Sep 7, 2018 at 11:07 AM, Stefan Loewen  
> wrote:
> > List of steps:
> > - 3.8G iso lays in read-only subvol A
> > - I create subvol B and reflink-copy the iso into it.
> > - I create a read-only snapshot C of B
> > - I "btrfs send --no-data C > /somefile"
> > So you got that right, yes.
>
> OK I can't reproduce it. Sending A and C complete instantly with
> --no-data, and complete in the same time with a full send/receive. In
> my case I used a 4.9G ISO.
>
> I can't think of what local difference accounts for what you're
> seeing. There is really nothing special about --reflinks. The extent
> and csum data are identical to the original file, and that's the bulk
> of the metadata for a given file.
>
> What I can tell you is usually the developers want to see sysrq+w
> whenever there are blocked tasks.
> https://fedoraproject.org/wiki/QA/Sysrq
>
> You'll want to enable all sysrq functions. And next you'll want three
> ssh shells:
>
> 1. sudo journalctl -fk
> 2. sudo -i to become root, and then echo w > /proc/sysrq-trigger but
> do not hit return yet
> 3. sudo btrfs send... to reproduce the problem.
>
> Basically the thing is gonna hang soon after you reproduce the
> problem, so you want to get to shell #2 and just hit return rather
> than dealing with long delays typing that echo command out. And then
> the journal command is so your local terminal captures the sysrq
> output because you're gonna kill the VM instead of waiting it out. I
> have no idea how to read these things but someone might pick up this
> thread and have some idea why these tasks are hanging.
>
>
>
>
> >
> > Unfortunately I don't have any way to connect the drive to a SATA port
> > directly but I tried to switch out as much of the used setup as
> > possible (all changes active at the same time):
> > - I got the original (not the clone) HDD out of the enclosure and used
> > this adapter to connect it:
> > https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK
> > - I used a different Notebook
> > - I ran the test natively on that notebook (instead of from
> > VirtualBox. I used VirtualBox for most of the tests as I have to
> > force-poweroff the PC everytime the btrfs-send hangs as it is not
> > killable)
>
>
> This problem only happens in VirtualBox? Or it happens on baremetal
> also? And we've established it happens with two different source
> (send) devices, which means two different Btrfs volumes.
>
> All I can say is you need to keep changing things up, process of
> elimination. Rather tedious. Maybe you could try downloading a Fedora
> 28 ISO, make a boot stick out of it, and try to reproduce with the
> same drives. At least that's an easy way to isolate the OS from the
> equation.
>
>
> --
> Chris Murphy


fedora-kernelmsgs.log.gz
Description: application/gzip