Re: System unable to mount partition after a power loss

2018-12-06 Thread Chris Murphy
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
>
> All,
>
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
>
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.



> btrfs check --recover also ends in a segmentation fault

I'm not familiar with --recover option, the --repair option is flagged
with a warning in the man page.
   Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user,


> btrfs --version:
> btrfs-progs v4.7.3

Old version of progs, I suggest upgrading to 4.17.1 and run

btrfs insp dump-s -f /device/
btrfs insp rescue super -v /device/
btrfs check --mode=lowmem /device/

These are all read only commands. Please post output to the list,
hopefully a developer will get around to looking at it.

It is safe to try:

mount -o ro,norecovery,usebackuproot /device/ /mnt/

If that works, I suggest updating your backup while it's still
possible in the meantime.


-- 
Chris Murphy


Re: Need help with potential ~45TB dataloss

2018-12-04 Thread Chris Murphy
On Tue, Dec 4, 2018 at 3:09 AM Patrick Dijkgraaf
 wrote:
>
> Hi Chris,
>
> See the output below. Any suggestions based on it?

If they're SATA drives, they may not support SCT ERC; and if they're
SAS, depending on what controller they're behind, smartctl might need
a hint to properly ask the drive for SCT ERC status. Simplest way to
know is do 'smartctl -x' on one drive, assuming they're all the same
basic make/model other than size.


-- 
Chris Murphy


Re: experiences running btrfs on external USB disks?

2018-12-03 Thread Chris Murphy
On Mon, Dec 3, 2018 at 10:44 PM Tomasz Chmielewski  wrote:
>
> I'm trying to use btrfs on an external USB drive, without much success.
>
> When the drive is connected for 2-3+ days, the filesystem gets remounted
> readonly, with BTRFS saying "IO failure":
>
> [77760.444607] BTRFS error (device sdb1): bad tree block start, want
> 378372096 have 0
> [77760.550933] BTRFS error (device sdb1): bad tree block start, want
> 378372096 have 0
> [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804:
> errno=-5 IO failure
> [77760.550979] BTRFS info (device sdb1): forced readonly
> [77760.551003] BTRFS: error (device sdb1) in
> btrfs_run_delayed_refs:2935: errno=-5 IO failure
> [77760.553223] BTRFS error (device sdb1): pending csums is 4096
>
>
> Note that there are no other kernel messages (i.e. that would indicate a
> problem with disk, cable disconnection etc.).
>
> The load on the drive itself can be quite heavy at times (i.e. 100% IO
> for 1-2 h and more) - can it contribute to the problem (i.e. btrfs
> thinks there is some timeout somewhere)?
>
> Running 4.19.6 right now, but was experiencing the issue also with 4.18
> kernels.
>
>
>
> # btrfs device stats /data
> [/dev/sda1].write_io_errs0
> [/dev/sda1].read_io_errs 0
> [/dev/sda1].flush_io_errs0
> [/dev/sda1].corruption_errs  0
> [/dev/sda1].generation_errs  0


Hard to say without a complete dmesg; but errno=-5 IO failure is
pretty much some kind of hardware problem in my experience. I haven't
seen it be a bug.

-- 
Chris Murphy


Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-03 Thread Chris Murphy
On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski  wrote:
>
> Need a bit of advice here ladies / gents. I am running into an issue
> which Qu Wenruo seems to have posted a patch for several weeks ago
> (see https://patchwork.kernel.org/patch/10694997/).
>
> Here is the relevant dmesg output which led me to Qu's patch.
> 
> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> invalid block group size, have 10804527104 expect (0, 10737418240]
> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> [   10.053365] BTRFS error (device sdb): open_ctree failed
> 
>
> This server has a 16 disk btrfs filesystem (RAID6) which I boot
> periodically to btrfs-send snapshots to. This machine is running
> ArchLinux and I had just updated  to their latest 4.19.4 kernel
> package (from 4.18.10 which was working fine). I've tried updating to
> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> the issue. From what I can see on kernel.org, the patch above is not
> pushed to stable or to Linus' tree.
>
> At this point the question is what to do. Is my FS toast? Could I
> revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19
> boot process may have flipped some bits which would make reverting
> problematic.

That patch is not yet merged in linux-next so to use it, you'd need to
apply yourself and compile a kernel. I can't tell for sure if it'd
help.

But, the less you change the file system, the better chance of saving
it. I have no idea why there'd be a corrupt leaf just due to a kernel
version change, though.

Needless to say, raid56 just seems fragile once it runs into any kind
of trouble. I personally wouldn't boot off it at all. I would only
mount it from another system, ideally an installed system but a live
system with the kernel versions you need would also work. That way you
can get more information without changes, and booting will almost
immediately mount rw, if mount succeeds at all, and will write a bunch
of changes to the file system.

Whether it's a case of 4.18.10 not detecting corruption that 4.19
sees, or if 4.19 already caused it, the best chance is to not mount it
rw, and not run check --repair, until you get some feedback from a
developer.

The thing I'd like to see is
# btrfs rescue super -v /anydevice/
# btrfs insp dump-s -f /anydevice/

First command will tell us if all the supers are the same and valid
across all devices. And the second one, hopefully it's pointed to a
device with valid super, will tell us if there's a log root value
other than 0. Both of those are read only commands.


-- 
Chris Murphy


Re: Need help with potential ~45TB dataloss

2018-12-03 Thread Chris Murphy
Also useful information for autopsy, perhaps not for fixing, is to
know whether the SCT ERC value for every drive is less than the
kernel's SCSI driver block device command timeout value. It's super
important that the drive reports an explicit read failure before the
read command is considered failed by the kernel. If the drive is still
trying to do a read, and the kernel command timer times out, it'll
just do a reset of the whole link and we lose the outcome for the
hanging command. Upon explicit read error only, can Btrfs, or md RAID,
know what device and physical sector has a problem, and therefore how
to reconstruct the block, and fix the bad sector with a write of known
good data.

smartctl -l scterc /device/
and
cat /sys/block/sda/device/timeout

Only if SCT ERC is enabled with a value below 30, or if the kernel
command timer is change to be well above 30 (like 180, which is
absolutely crazy but a separate conversation) can we be sure that
there haven't just been resets going on for a while, preventing bad
sectors from being fixed up all along, and can contribute to the
problem. This comes up on the linux-raid (mainly md driver) list all
the time, and it contributes to lost RAID all the time. And arguably
it leads to unnecessary data loss in even the single device
desktop/laptop use case as well.


Chris Murphy


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Chris Murphy
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
 wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy


Re:

2018-11-22 Thread Chris Murphy
On Thu, Nov 22, 2018 at 11:41 PM Andy Leadbetter
 wrote:
>
> I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
> have added a new 2TB disk to the computer, and started a BTRFS replace
> for the old and new disk.  The process starts correctly however some
> hours into the job, there is an error and kernel oops. relevant log
> below.

The relevant log is the entire dmesg, not a snippet. It's decently
likely there's more than one thing going on here. We also need full
output of 'smartctl -x' for all four drives, and also 'smartctl -l
scterc' for all four drives, and also 'cat
/sys/block/sda/device/timeout' for all four drives. And which bcache
mode you're using.

The call trace provided is from kernel 4.15 which is sufficiently long
ago I think any dev working on raid56 might want to see where it's
getting tripped up on something a lot newer, and this is why:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v4.19.3=v4.15.1

That's a lot of changes in just the raid56 code between 4.15 and 4.19.
And then in you call trace, btrfs_dev_replace_start is found in
dev-replace.c which likewise has a lot of changes. But then also, I
think 4.15 might still be in the era where it was not recommended to
use 'btrfs dev replace' for raid56, only non-raid56. I'm not sure if
the problems with device replace were fixed, and if they were fixed
kernel or progs side. Anyway, the latest I recall, it was recommended
on raid56 to 'btrfs dev add' then 'btrfs dev remove'.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/dev-replace.c?id=v4.19.3=v4.15.1

And that's only a few hundred changes for each. Check out inode.c -
there are over 2000 changes.


> The disks are configured on top of bcache, in 5 arrays with a small
> 128GB SSD cache shared.  The system in this configuration has worked
> perfectly for 3 years, until 2 weeks ago csum errors started
> appearing.  I have a crashplan backup of all files on the disk, so I
> am not concerned about data loss, but I would like to avoid rebuild
> the system.

btrfs-progs 4.17 still considers raid56 experimental, not for
production use. And three years ago, the current upstream kernel
released was 4.3 so I'm gonna guess the kernel history of this file
system goes back older than that, very close to raid56 code birth. And
then adding bcache to this mix just makes it all the more complicated.



>
> btrfs dev stats shows
> [/dev/bcache0].write_io_errs0
> [/dev/bcache0].read_io_errs 0
> [/dev/bcache0].flush_io_errs0
> [/dev/bcache0].corruption_errs  0
> [/dev/bcache0].generation_errs  0
> [/dev/bcache1].write_io_errs0
> [/dev/bcache1].read_io_errs 20
> [/dev/bcache1].flush_io_errs0
> [/dev/bcache1].corruption_errs  0
> [/dev/bcache1].generation_errs  14
> [/dev/bcache3].write_io_errs0
> [/dev/bcache3].read_io_errs 0
> [/dev/bcache3].flush_io_errs0
> [/dev/bcache3].corruption_errs  0
> [/dev/bcache3].generation_errs  19
> [/dev/bcache2].write_io_errs0
> [/dev/bcache2].read_io_errs 0
> [/dev/bcache2].flush_io_errs0
> [/dev/bcache2].corruption_errs  0
> [/dev/bcache2].generation_errs  2


3 of 4 drives have at least one generation error. While there are no
corruptions reported, generation errors can be really tricky to
recover from at all. If only one device had only read errors, this
would be a lot less difficult.


> I've tried the latest kernel, and the latest tools, but nothing will
> allow me to replace, or delete the failed disk.

If the file system is mounted, I would try to make a local backup ASAP
before you lose the whole volume. Whether it's LVM pool of two drives
(linear/concat) with XFS, or if you go with Btrfs -dsingle -mraid1
(also basically a concat) doesn't really matter, but I'd get whatever
you can off the drive. I expect avoiding a rebuild in some form or
another is very wishful thinking and not very likely.

The more changes are made to the file system, repair attempts or
otherwise writing to it, decreases the chance of recovery.

-- 
Chris Murphy


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Chris Murphy
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski  wrote:
>
> On 2018-11-22 21:46, Nikolay Borisov wrote:
>
> >> # echo w > /proc/sysrq-trigger
> >>
> >> # dmesg -c
> >> [  931.585611] sysrq: SysRq : Show Blocked State
> >> [  931.585715]   taskPC stack   pid father
> >> [  931.590168] btrfs-cleaner   D0  1340  2 0x8000
> >> [  931.590175] Call Trace:
> >> [  931.590190]  __schedule+0x29e/0x840
> >> [  931.590195]  schedule+0x2c/0x80
> >> [  931.590199]  schedule_timeout+0x258/0x360
> >> [  931.590204]  io_schedule_timeout+0x1e/0x50
> >> [  931.590208]  wait_for_completion_io+0xb7/0x140
> >> [  931.590214]  ? wake_up_q+0x80/0x80
> >> [  931.590219]  submit_bio_wait+0x61/0x90
> >> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> >> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> >> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> >> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> >> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> >> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> >> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> >> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> >> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> >> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> >> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> >> [  931.590612]  kthread+0x120/0x140
> >> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> >> [  931.590658]  ? kthread_bind+0x40/0x40
> >> [  931.590661]  ret_from_fork+0x22/0x40
> >>
> >
> > It seems your filesystem is mounted with the DSICARD option meaning
> > every delete will result in discard this is highly suboptimal for
> > ssd's.
> > Try remounting the fs without the discard option see if it helps.
> > Generally for discard you want to submit it in big batches (what fstrim
> > does) so that the ftl on the ssd could apply any optimisations it might
> > have up its sleeve.
>
> Spot on!
>
> Removed "discard" from fstab and added "ssd", rebooted - no more
> btrfs-cleaner running.
>
> Do you know if the issue you described ("discard this is highly
> suboptimal for ssd") affects other filesystems as well to a similar
> extent? I.e. if using ext4 on ssd?

Quite a lot of activity on ext4 and XFS are overwrites, so discard
isn't needed. And it might be discard is subject to delays. On Btrfs,
it's almost immediate, to the degree that on a couple SSDs I've
tested, stale trees referenced exclusively by the most recent backup
tree entires in the superblock are already zeros. That functionally
means no automatic recoveries at mount time if there's a problem with
any of the current trees.

I was using it for about a year to no ill effect, BUT not a lot of
file deletions either. I wouldn't recommend it, and instead suggest
enabling the fstrim.timer which by default runs fstrim.service once a
week (which in turn issues fstrim, I think on all mounted volumes.)

I am a bit more concerned about the read errors you had that were
being corrected automatically? The corruption suggests a firmware bug
related to trim. I'd check the affected SSD firmware revision and
consider updating it (only after a backup, it's plausible the firmware
update is not guaranteed to be data safe). Does the volume use DUP or
raid1 metadata? I'm not sure how it's correcting for these problems
otherwise.


-- 
Chris Murphy


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-16 Thread Chris Murphy
On Thu, Nov 15, 2018 at 10:39 AM Juan Alberto Cirez
 wrote:
>
> Is BTRFS mature enough to be deployed on a production system to underpin
> the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?
>
> Based on our limited experience with BTRFS (1+ year) under the above
> scenario the answer seems to be no; but I wanted you ask the community
> at large for their experience before making a final decision to hold off
> on deploying BTRFS on production systems.
>
> Let us be clear: We think BTRFS has great potential, and as it matures
> we will continue to watch its progress, so that at some future point we
> can return to using it.
>
> The issue has been the myriad of problems we have encountered when
> deploying BTRFS as the storage fs for the NVR/VMS in cases were the
> camera count exceeds 10: Corrupted file systems, sudden read-only file
> system, re-balance kernel panics, broken partitions, etc.

Performance problems are separate from reliability problems. No matter
what, there shouldn't be corruptions or failures when your process is
writing through the Btrfs kernel driver. Period. So you've either got
significant hardware/firmware problems as the root cause, or your use
case is exposing Btrfs bugs.

But you're burdened with providing sufficient details about the
hardware and storage stack configuration including kernel, btrfs-progs
versions and mkfs options and mount options being used. Without a way
for a developer to reproduce your problem it's unlikely the source of
the problem can be discovered and fixed.


> So, again, the question is: is BTRFS mature enough to be used in such
> use case and if so, what approach can be used to mitigate such issues.

What format are the cameras writing out in? It matters if this is a
continuous appending format, or if it's writing them out as individual
JPEG files, one per frame, or whatever. What rate, what size, and any
other concurrent operations, etc.

-- 
Chris Murphy


Re: Where is my disk space ?

2018-11-08 Thread Chris Murphy
On Thu, Nov 8, 2018 at 2:27 AM, Barbet Alain  wrote:
> Hi !
> Just to give you end of the story:
> I move my /var/lib/docker to my home (other partition), and my space
> come back ...

I'm not sure why that would matter. Both btrfs du and regular du
showed only ~350M used in /var which is about what I'd expect. And
also the 'btrfs sub list' output doesn't show any subvolumes/snapshots
for Docker. The upstream Docker behavior on Btrfs is that it uses
subvolumes and snapshots for everything, quickly you'll see a lot of
them. However many distributions override the default Docker behavior,
e.g. with Docker storage setup, and will cause it to always favor a
particular driver. For example the Docker overlay2 driver, which
leverages kernel overlayfs, which will work on any file system
including Btrfs. And I'm not exactly sure where the upper dirs are
stored, but I'd be surprised if they're not in /var.

Anyway, if you're using Docker, moving stuff around will almost
certainly break it. And as I'm an extreme expert in messing up Docker
storage, I can vouch for the strategy of stopping the docker daemon,
recursively deleting everything in /var/lib/docker/ and then starting
Docker. Now you get to go fetch all your images again. And anyway, you
shouldn't be storing any data in the containers, they should be
throwaway things, important data should be stored elsewhere including
any state information for the container. :-D Avoid container misery by
having a workflow that expects containers to be transient disposable
objects.


-- 
Chris Murphy


Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Chris Murphy
On Mon, Nov 5, 2018 at 6:27 AM, Austin S. Hemmelgarn
 wrote:
> On 11/4/2018 11:44 AM, waxhead wrote:
>>
>> Sterling Windmill wrote:
>>>
>>> Out of curiosity, what led to you choosing RAID1 for data but RAID10
>>> for metadata?
>>>
>>> I've flip flipped between these two modes myself after finding out
>>> that BTRFS RAID10 doesn't work how I would've expected.
>>>
>>> Wondering what made you choose your configuration.
>>>
>>> Thanks!
>>> Sure,
>>
>>
>> The "RAID"1 profile for data was chosen to maximize disk space utilization
>> since I got a lot of mixed size devices.
>>
>> The "RAID"10 profile for metadata was chosen simply because it *feels* a
>> bit faster for some of my (previous) workload which was reading a lot of
>> small files (which I guess was embedded in the metadata). While I never
>> remembered that I got any measurable performance increase the system simply
>> felt smoother (which is strange since "RAID"10 should hog more disks at
>> once).
>>
>> I would love to try "RAID"10 for both data and metadata, but I have to
>> delete some files first (or add yet another drive).
>>
>> Would you like to elaborate a bit more yourself about how BTRFS "RAID"10
>> does not work as you expected?
>>
>> As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1
>> replica) is striped over as many disks it can (as long as there is free
>> space).
>>
>> So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe
>> over (20/2) x 2 and if you run out of space on 10 of the devices it will
>> continue to stripe over (5/2) x 2. So your stripe width vary with the
>> available space essentially... I may be terribly wrong about this (until
>> someones corrects me that is...)
>
> He's probably referring to the fact that instead of there being a roughly
> 50% chance of it surviving the failure of at least 2 devices like classical
> RAID10 is technically able to do, it's currently functionally 100% certain
> it won't survive more than one device failing.

Right. Classic RAID10 is *two block device* copies, where you have
mirror1 drives and mirror2 drives, and each mirror pair becomes a
single virtual block device that are then striped across. If you lose
a single mirror1 drive, its mirror2 data is available and
statistically unlikely to also go away.

Whereas with Btrfs raid10, it's *two block group* copies. And it is
the block group that's striped. That means block group copy 1 is
striped across 1/2 the available drives (at the time the bg is
allocated), and block group copy 2 is striped across the other drives.
When a drive dies, there is no single remaining drive that contains
all the missing copies, they're distributed. Which means you've got a
very good chance in a 2 drive failure of losing two copies of either
metadata or data or both. While I'm not certain it's 100% not
survivable, the real gotcha is it's possible maybe even likely that
it'll mount and seem to work fine but as soon as it runs into two
missing bg's, it'll face plant.


-- 
Chris Murphy


Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
Also, since you don't have any snapshots, you could also find this
conventionally:

# du -sh /*


Chris Murphy


Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 4:44 PM, Barbet Alain  wrote:
> Thanks for answer !
> alian@alian:~>  sudo btrfs sub list -ta /
> [sudo] Mot de passe de root :
> ID  gen top level   path
> --  --- -   
> 257 79379   5   /@
> 258 79386   257 @/var
> 259 79000   257 @/usr/local
> 260 79376   257 @/tmp
> 261 79001   257 @/srv
> 262 79062   257 @/root
> 263 79001   257 @/opt
> 264 78898   257 @/boot/grub2/x86_64-efi
> 265 78933   257 @/boot/grub2/i386-pc
>
> Yes it's opensuse, but I don't see any snapper config enable.
> For memory, I use docker that full my disk, I remove subvolume, but
> it's look like something is missing somewhere.

Try

mount -o subvolid=5  /mnt
cd /mnt
btrfs fi du -s *

Maybe that will help reveal where it's hiding. It's possible btrfs fi
du does not cross bind mounts. I know the Total column does include
amounts in nested subvolumes.



-- 
Chris Murphy


Re: Salvage files from broken btrfs

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 4:11 PM, Mirko Klingmann  wrote:
> Hi all,
>
> my btrfs root file system on a SD card broke down and did not mount anymore.

It might mount with -o ro,nologreplay

Typically an SD card will break in a way that it can't write, and
mount will just hang (with mmcblk errors). Mounting with both ro and
nologreplay will ensure no writes are needed, allowing the mount to
succeed. of course any changes that are in the log tree will be
missing so recent transactions may be unrecoverable but so far I've
had good luck recovering from broken SD cards this way.




-- 
Chris Murphy


Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 9:17 AM, Barbet Alain  wrote:
> Hi,
> I experienced disk out of space issue:
> alian:~ # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> devtmpfs7.8G 0  7.8G   0% /dev
> tmpfs   7.8G   47M  7.8G   1% /dev/shm
> tmpfs   7.8G   18M  7.8G   1% /run
> tmpfs   7.8G 0  7.8G   0% /sys/fs/cgroup
> /dev/sda641G   35G  5.1G  88% /
> /dev/sda641G   35G  5.1G  88% /var
> /dev/sda641G   35G  5.1G  88% /root
> /dev/sda641G   35G  5.1G  88% /srv
> /dev/sda641G   35G  5.1G  88% /opt
> /dev/sda641G   35G  5.1G  88% /boot/grub2/i386-pc
> /dev/sda641G   35G  5.1G  88% /usr/local
> /dev/sda641G   35G  5.1G  88% /tmp
> /dev/sda641G   35G  5.1G  88% /boot/grub2/x86_64-efi
> /dev/sda7   424G  200G  225G  48% /home
>
>
> It say I use 35Go / 41. But I have only 5,8Go of data:
> alian:~ # btrfs fi du -s /
>  Total   Exclusive  Set shared  Filename
>5.84GiB 5.84GiB   0.00B  /
> alian:/ # du -h --exclude ./home --max-depth=0
> 6.2G.

I suspect there are snapshots taking up space that are no located in
the search path starting at /

What do you get for:

$ sudo btrfs sub list -ta /

Is this an openSUSE system? If snapper is enabled, you'll need to ask
it to delete some of the snapshots to free up space rather than doing
it with btrfs user space tools.




> alian:/ # btrfs fi df /
> Data, single: total=35.00GiB, used=34.18GiB
> System, DUP: total=32.00MiB, used=16.00KiB
> Metadata, DUP: total=384.00MiB, used=216.75MiB
> GlobalReserve, single: total=22.05MiB, used=0.00B
>
> I try to run btrfs balance multiple time with various parameters but
> it doesn't change anything nor trying btrf check in single user mode.
>
> Where is my 30 Go missing ?



-- 
Chris Murphy


Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 10:10 AM, Ulli Horlacher
 wrote:
>
> On Mon 2018-10-29 (17:57), Remi Gauvin wrote:
>> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
>>
>> > I want to know how many free space is left and have problems in
>> > interpreting the output of:
>> >
>> > btrfs filesystem usage
>> > btrfs filesystem df
>> > btrfs filesystem show
>>
>> In my not so humble opinion, the filesystem usage command has the
>> easiest to understand output.  It' lays out all the pertinent information.
>>
>> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
>> filesystem show is actually using the "Allocated" value as "Used".
>> Allocated can be thought of "Reserved For".
>
> And what is "Device unallocated"? Not reserved?

That's a reasonable interpretation. Unallocated space is space that's
not used for anything: no data, no metadata, and isn't reference by
any block group.

It's not a relevant number day to day, I'd say it's advanced leaning
toward esoteric knowledge of Btrfs internals. At this point I'd like
to see a simper output by default, and have a verbose option for
advanced users, and an export option that spits out a superset of all
available information in a format parsable for scripts. But I know
there are other project that depend on btrfs user space output, rather
than having something specifically invented for them that's easily
parsed, and can be kept consistent and extendible, separate from human
user consumption. Oh well!




>> The disparity between 498GiB used and 823Gib is pretty high.  This is
>> probably the result of using an SSD with an older kernel.  If your
>> kernel is not very recent, (sorry, I forget where this was fixed,
>> somewhere around 4.14 or 4.15), then consider mounting with the nossd
>> option.
>
> I am running kernel 4.4 (it is a Ubuntu 16.04 system)
> But /local is on a SSD. Should I really use nossd mount option?!

Yes. But it's not a file system integrity suggestion, it's an optimization.


>
>
>
>> You can improve this by running a balance.
>>
>> Something like:
>> btrfs balance start -dusage=55
>
> I run balance via cron weekly (adapted
> https://software.opensuse.org/package/btrfsmaintenance)

With a newer kernel you can probably reduce this further depending on
your workload and use case. And optionally follow it up with executing
fstrim, or just enable fstrim.timer (we don't recommend using discard
mount option for most use cases as it too aggressively discards very
recently stale Btrfs metadata and can make recovery from crashes
harder).

There is a trim bug that causes FITRIM to only get applied to
unallocated space on older file systems, that have been balanced such
that block group logical addresses are outside the physical address
space of the device which prevents the free space inside of such block
groups to be passed over for FITRIM. Looks like this will be fixed in
kernel 4.20/5.0




-- 
Chris Murphy


Re: Kernel crash related to LZO compression

2018-10-25 Thread Chris Murphy
On Thu, Oct 25, 2018 at 9:56 AM, Dmitry Katsubo  wrote:
> Dear btrfs community,
>
> My excuses for the dumps for rather old kernel (4.9.25), nevertheless I
> wonder
> about your opinion about the below reported kernel crashes.
>
> As I could understand the situation (correct me if I am wrong), it happened
> that some data block became corrupted which resulted the following kernel
> trace
> during the boot:
>
> kernel BUG at /build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318!
> invalid opcode:  [#1] SMP
> Call Trace:
>  [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs]
>  [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? process_one_work+0x380/0x380
>  [] ? kthread_park+0x50/0x50
>  [] ? ret_from_fork+0x1b/0x28
>
> The problematic file turned out to be the one used by systemd-journald
> /var/log/journal/c496cea41ebc4700a0dfaabf64a21be4/system.journal
> which was trying to read it (or append to it) during the boot and that was
> causing the system crash (see attached bootN_dmesg.txt).
>
> I've rebooted in safe mode and tried to copy the data from this partition to
> another location using btrfs-restore, however kernel was crashing as well
> with
> a bit different symphom (see attached copyN_dmesg.txt):
>
> Call Trace:
>  [] ? lzo_decompress_biovec+0x1b0/0x2b0 [btrfs]
>  [] ? vmalloc+0x38/0x40
>  [] ? end_compressed_bio_read+0x265/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? ret_from_fork+0x1b/0x28
>
> Just to keep away from the problem, I've removed this file and also removed
> "compress=lzo" mount option.
>
> Are there any updates / fixes done in that area? Is lzo option safe to use?


It should be safe even with that kernel. I'm not sure this is
compression related. There is a corruption bug related to inline
extents and corruption that had been fairly elusive but I think it's
fixed now. I haven't run into it though.

I would say the first step no matter what if you're using an older
kernel, is to boot a current Fedora or Arch live or install media,
mount the Btrfs and try to read the problem files and see if the
problem still happens. I can't even being to estimate the tens of
thousands of line changes since kernel 4.9.

What profile are you using for this Btrfs? Is this a raid56? What do
you get for 'btrfs fi us ' ?



-- 
Chris Murphy


Re: Failover for unattached USB device

2018-10-25 Thread Chris Murphy
it can detect such problems in both
metadata and data and should be able to avoid them in the first place
due to always on COW (as long as you haven't disabled it).

But there is some evidence that old Btrfs bugs could induce corruption
in metadata, and not turn into a problem for a long time later. The
scrubs only check if metadata and its checksum match up (corruption
detection elsewhere in the storage stack) so the scrub most often
can't find bugs that cause corruption. You best bet for side stepping
such problems is backups, and using the most recent kernel you can. If
you encounter some problem that might be a bug, inevitably you'll need
to test with a newer kernel version anyway to see if it's still a bug.
Each merge cycle involves thousands of lines of changes just for Btrfs
and there's more to the storage stack in the kernel than just Btrfs.

In your use case with mostly reads, and probably you also don't care
about write performance, you could consider mounting with notreelog.
This will drop the use of the treelog which is used to improve
performance on operations that use fsync. With this option,
transactions calling fsync() fall back to sync() so it's safer but
slower.


-- 
Chris Murphy


Re: Failover for unattached USB device

2018-10-24 Thread Chris Murphy
On Wed, Oct 24, 2018 at 9:03 AM, Dmitry Katsubo  wrote:
> On 2018-10-17 00:14, Dmitry Katsubo wrote:
>>
>> As a workaround I can monitor dmesg output but:
>>
>> 1. It would be nice if I could tell btrfs that I would like to mount
>> read-only
>> after a certain error rate per minute is reached.
>> 2. It would be nice if btrfs could detect that both drives are not
>> available and
>> unmount (as mount read-only won't help much) the filesystem.
>>
>> Kernel log for Linux v4.14.2 is attached.
>
>
> I wonder if somebody could further advise the workaround. I understand that
> running
> btrfs volume over USB devices is not good, but I think btrfs could play some
> role
> as well.

I think about the best we can expect in the short term is that Btrfs
goes read-only before the file system becomes corrupted in a way it
can't recover with a normal mount. And I'm not certain it is in this
state of development right now for all cases. And I say the same thing
for other file systems as well.

Running Btrfs on USB devices is fine, so long as they're well behaved.
I have such a setup with USB 3.0 devices. Perhaps I got a bit lucky,
because there are a lot of known bugs with USB controllers, USB bridge
chipsets, and USB hubs.

Having user definable switches for when to go read-only is, I think
misleading to the user, and very likely will mislead the file system.
The file system needs to go read-only when it gets confused, period.
It doesn't matter what the error rate is.

The work around is really to do the hard work making the devices
stable. Not asking Btrfs to paper over known unstable hardware.

In my case, I started out with rare disconnects and resets with
directly attached drives. This was a couple years ago. It was a Btrfs
raid1 setup, and the drives would not go missing at the same time, but
both would just drop off from time to time. Btrfs would complain of
dropped writes, I vaguely remember it going read only. But normal
mounts worked, sometimes with scary errors but always finding a good
copy on the other drive, and doing passive fixups. Scrub would always
fix up the rest. I'm still using those same file systems on those
devices, but now they go through a dyconn USB 3.0 hub with a decently
good power supply. I originally thought the drop offs were power
related, so I explicitly looked for a USB hub that could supply at
least 2A, and this one is 12VDC @ 2500mA. A laptop drive will draw
nearly 1A on spin up, but at that point P=AV. Laptop drives during
read/write using 1.5 W to 2.5 W @ 5VDC.

1.5-2.5 W = A * 5 V
Therefore A = 0.3-0.5A

And for 4 drives at possibly 0.5 A (although my drives are all at the
1.6 W read/write), that's 2 A @ 5 V, which is easily maintained for
the hub power supply (which by my calculation could do 6 A @ 5 V, not
accounting for any resistance).

Anyway, as it turns out I don't think it was power related, as the
Intel NUC in question probably had just enough amps per port. And what
it really was, was incompatibility between the Intel controller and
the bridgechipset in the USB-SATA cases, and the USB hub is similar to
an ethernet hub, it actually reads the USB stream and rewrites it out.
So hubs are actually pretty complicated little things, and having a
good one matters.

>
> In particular I wonder if btrfs could detect that all devices in RAID1
> volume became
> inaccessible and instead of reporting increasing "write error" counter to
> kernel log simply
> render the volume as read-only. "inaccessible" could be that if the same
> block cannot be
> written back to minimum number of devices in RAID volume, so btrfs gives up.

There are pending patches for something similar that you can find in
the archives. I think the reason they haven't been merged yet is there
haven't been enough comments and feedback (?). I think Anand Jain is
the author of those patches so you might dig around in the archives.
In a way you have an ideal setup for testing them out. Just make sure
you have backups...


>
> Maybe someone can advise some sophisticated way of quick checking that
> filesystems is
> healthy?

'btrfs check' without the --repair flag is safe and read only but
takes a long time because it'll read all metadata. The fastest safe
way is to mount it ro and read a directory recently being written to
and see if there are any kernel errors. You could recursively copy
files from a directory to /dev/null and then check kernel messages for
any errors. So long as metadata is DUP, there is a good chance a bad
copy of metadata can be automatically fixed up with a good copy. If
there's only single copy of metadata, or both copies get corrupt, then
it's difficult. Usually recovery of data is possible, but depending on
what's damaged, repair might not be possible.


-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-18 Thread Chris Murphy
On Tue, Oct 16, 2018 at 10:08 PM, Anand Jain  wrote:


>
>  So a possible solution for the reproducible builds:
>usual mkfs.btrfs dev
>Write the data
>unmount; create btrfs-image with uuid/fsid/time sanitized; mark it as a
> seed (RO).
>check/verify the hash of the image.

Gotcha. Generation/transid needs to be included in that list. Imagine
a fast system vs a slow system. The slow system certainly will end up
with with higher transid's for the latest completed transactions.

But also, I don't know how the kernel code chooses block numbers,
either physical (chunk allocation) or logical (extent allocation) and
if that could be made deterministic. Same for inode assignment.

Another question that comes up later when creating the sprout by
removing the seed device, is how a script can know when all block
groups have successfully copied from seed to sprout, and that the
sprout can be unmounted.



-- 
Chris Murphy


Re: CRC mismatch

2018-10-16 Thread Chris Murphy
On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn
 wrote:
> On 2018-10-16 11:30, Anton Shepelev wrote:
>>
>> Hello, all
>>
>> What may be the reason of a CRC mismatch on a BTRFS file in
>> a virutal machine:
>>
>> csum failed ino 175524 off 1876295680 csum 451760558
>> expected csum 1446289185
>>
>> Shall I seek the culprit in the host machine on in the guest
>> one?  Supposing the host machine healty, what operations on
>> the gueest might have caused a CRC mismatch?
>>
> Possible causes include:
>
> * On the guest side:
>   - Unclean shutdown of the guest system (not likely even if this did
> happen).
>   - A kernel bug on in the guest.
>   - Something directly modifying the block device (also not very likely).
>
> * On the host side:
>   - Unclean shutdown of the host system without properly flushing data from
> the guest.  Not likely unless you're using an actively unsafe caching mode
> for the guest's storage back-end.
>   - At-rest data corruption in the storage back-end.
>   - A bug in the host-side storage stack.
>   - A transient error in the host-side storage stack.
>   - A bug in the hypervisor.
>   - Something directly modifying the back-end storage.
>
> Of these, the statistically most likely location for the issue is probably
> the storage stack on the host.

Is there still that O_DIRECT related "bug" (or more of a limitation)
if the guest is using cache=none on the block device?

Anton what virtual machine tech are you using? qemu/kvm managed with
virt-manager? The configuration affects host behavior; but the
negative effect manifests inside the guest as corruption. If I
remember correctly.

-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-16 Thread Chris Murphy
On Tue, Oct 16, 2018 at 2:13 AM, Anand Jain  wrote:
>
>
> On 10/14/2018 06:28 AM, Chris Murphy wrote:
>>
>> Is it practical and desirable to make Btrfs based OS installation
>> images reproducible? Or is Btrfs simply too complex and
>> non-deterministic? [1]
>>
>> The main three problems with Btrfs right now for reproducibility are:
>> a. many objects have uuids other than the volume uuid; and mkfs only
>> lets us set the volume uuid
>> b. atime, ctime, mtime, otime; and no way to make them all the same
>> c. non-deterministic allocation of file extents, compression, inode
>> assignment, logical and physical address allocation
>>
>> I'm imagining reproducible image creation would be a mkfs feature that
>> builds on Btrfs seed and --rootdir concepts to constrain Btrfs
>> features to maybe make reproducible Btrfs volumes possible:
>>
>> - No raid
>> - Either all objects needing uuids can have those uuids specified by
>> switch, or possibly a defined set of uuids expressly for this use
>> case, or possibly all of them can just be zeros (eek? not sure)
>> - A flag to set all times the same
>> - Possibly require that target block device is zero filled before
>> creation of the Btrfs
>> - Possibly disallow subvolumes and snapshots
>> - Require the resulting image is seed/ro and maybe also a new
>> compat_ro flag to enforce that such Btrfs file systems cannot be
>> modified after the fact.
>> - Enforce a consistent means of allocation and compression
>>
>> The end result is creating two Btrfs volumes would yield image files
>> with matching hashes.
>
>
>> If I had to guess, the biggest challenge would be allocation. But it's
>> also possible that such an image may have problems with "sprouts". A
>> non-removable sprout seems fairly straightforward and safe; but if a
>> "reproducible build" type of seed is removed, it seems like removal
>> needs to be smart enough to refresh *all* uuids found in the sprout: a
>> hard break from the seed.
>
>
> Right. The seed fsid will be gone in a detached sprout.

I think already we get a new devid, volume uuid, and device uuid. Open
question is whether any other uuid's need to be refreshed, such as
chunk uuid since that appears in every node and leaf.


>> Any thoughts? Useful? Difficult to implement?
>
> Recently Nikolay sent a patch to change fsid on a mounted btrfs. However for
> a reproducible builds it also needs neutralized uuids, time, bytenr(s)
> further more though the ondisk layout won't change without notice but
> block-bytenr might.

Seems like the mkfs population method of such a seed, could be made
very deterministic as to what the start logical address and physical
address are. The vast majority of non-deterministic behavior comes
from the nature of kernel code having to handle so many complex inputs
and outputs, and negotiate them.


> One question why not reproducible builds get the file data extents from the
> image and stitch the hashes together to verify the hash. And there could be
> a vfs ioctl to import and export filesystem images for a better
> support-ability of the use-case similar to the reproducible builds.

Perhaps. I don't know the reproducible build requirements very well,
if all they really care about is the hash of the data extents, and
really how important fs metadata is. That is important when it comes
to fuzzing file systems that have no metadata checksumming like
squashfs; of course you'd have to checksum the whole file system
image.

Another feature the mkfs variety of seed image would need,
deduplication.  As far as I know, deduplication is kernel code only.
You'd want to be able to deduplicate, as well as compress, to have the
smallest distributed seed possible. And mksquashfs does deduplication
by default.


-- 
Chris Murphy


Re: Spurious mount point

2018-10-15 Thread Chris Murphy
On Mon, Oct 15, 2018 at 3:26 PM, Anton Shepelev  wrote:
> Chris Murphy to Anton Shepelev:
>
>> > How can I track down the origin of this mount point:
>> >
>> > /dev/sda2 on /home/hana type btrfs 
>> > (rw,relatime,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot/hana)
>> >
>> > if it is not present in /etc/fstab?  I shouldn't like to
>> > find/grep thoughout the whole filesystem.
>>
>> Sounds like some service taking snapshots regularly is
>> managing this.  Maybe this is Mint or Ubuntu and you're
>> using Timeshift?
>
> It is SUSE Linux and (probably) its tool called `snapper',
> but I have not found a clue in its documentation.

I wasn't aware that SUSE was now using the @ location for snapshots,
or that it was using Btrfs for /home. For a while it's been XFS with a
Btrfs sysroot.





-- 
Chris Murphy


Re: Spurious mount point

2018-10-15 Thread Chris Murphy
On Mon, Oct 15, 2018 at 9:05 AM, Anton Shepelev  wrote:
> Hello, all
>
> How can I track down the origin of this mount point:
>
>/dev/sda2 on /home/hana type btrfs 
> (rw,relatime,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot/hana)
>
> if it is not present in /etc/fstab?  I shouldn't like to
> find/grep thoughout the whole filesystem.
>
> --
> ()  ascii ribbon campaign - against html e-mail
> /\  http://preview.tinyurl.com/qcy6mjc [archived]

Sounds like some service taking snapshots regularly is managing this.
Maybe this is Mint or Ubuntu and you're using Timeshift?

Maybe it'll show up in the journal if you add boot parameter
'systemd.log_level=debug' and reboot; then use 'journalctl -b | grep
mount' and it should show all instances logged instances of mount
events: systemd, udisks2, maybe others?



-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-15 Thread Chris Murphy
On Mon, Oct 15, 2018 at 6:29 AM, Austin S. Hemmelgarn
 wrote:
> On 2018-10-13 18:28, Chris Murphy wrote:

>> The end result is creating two Btrfs volumes would yield image files
>> with matching hashes.
>
> So in other words, you care about matching the block layout _exactly_.

Only because that's the easiest way to verify reproducibility without
any ambiguity.

If someone's compromised a build system such that everyone is getting
the malicious payload, but they can hide it behind a subvolume or
reflink that's not used by default, could someone plausibly cause
selective use of their malicious payload? I dunno I leave that for
more crafty people. But even if it's a tiny bit of ambiguity, it's
non-zero. Hashing a file that contains the entire file system is
unambiguous.

I think populating the image with --rootdir at mkfs time should be
pretty deterministic. One stream in and out. No generations, no
snapshot, no delayed allocation. It'd be quite similar to mksquashfs.
I guess I'd have to try it a few times, and see if really the only
differences are uuids and times, and not allocation related things.



-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Chris Murphy
On Sun, Oct 14, 2018 at 1:09 PM, Cerem Cem ASLAN  wrote:
> Thanks for the explanation, I got it now. I still think this is
> related with my needs, so I'll keep an eye on this.
>
> What is the possible use case? I can think of only one scenario: You
> have a rootfs that contains a distro installer and you want to
> generate distro.img files which uses Btrfs under the hood in different
> locations and still have the same hash, so you can publish your
> verified image hash by a single source (https://your-distro.org).

The first step is accepting reproducible builds as a worthy goal in
and of itself independent of Btrfs. Specifically "Why does it matter?"
found here https://reproducible-builds.org/

Btrfs does bring valuable features for installation images: always on
checksumming; seed feature permits a straightforward way to setup a
volatile overlay on zram device; ability to convert it to a
non-volatile overlay, and boot either the seed or overlay; and even
installation by adding the install target and removing both overlay
and seed. And yet it remains compatible with a conventional copy to
another file system if it's not desirable to use Btrfs as root. Win
win.

By subsetting Btrfs features we don't care about in the installation
seed context, can we achieve reproducibility as a consequence, while
retaining some of the more interesting features? Of course once
sprouted, those limitations wouldn't apply.

Basically it's a "btrfs seed device 2.0" idea. But Btrfs is so
complicated it's maybe too much work, hence the question.



-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Chris Murphy
On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN  wrote:
> I'm not sure I could fully understand the desired achievement but it
> sounds like (or this would be an example of selective perception) it's
> somehow related with "creating reproducible snapshots"
> (https://unix.stackexchange.com/q/462451/65781), no?

No the idea is to be able to consistently reproduce a distro installer
image (like an ISO file) with the same hash. Inside the ISO image, is
typically a root.img or squash.img which itself contains a file system
like ext4 or squashfs, to act as the system root. And that root.img is
the main thing I'm talking about here. There is work to make squashfs
deterministic, as well as ext4. And I'm wondering if there are sane
ways to constrain Btrfs features to make it likewise deterministic.

For example:

fallocate -l 5G btrfsroot.img
losetup /dev/loop0 btrfsroot.img
mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T
"20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ...
shasum btrfsroot.img

And then do it again, and the shasum's should be the same. I realize
today it's not that way. And that inode assignment, extent allocation
(number, size, locality) are all factors in making Btrfs quickly
non-determinstic, and also why I'm assuming this needs to be done in
user space. That would be the point of the -rseed flag: set the seed
flag, possibly set a compat_ro flag, fix generation/transid to 1,
require the use of -T (similar to make_ext4) to set all timestamps to
this value, and configurable uuid's for everything that uses uuids,
and whatever other constraints are necessary.


-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-13 Thread Chris Murphy
On Sat, Oct 13, 2018 at 4:28 PM, Chris Murphy  wrote:
> Is it practical and desirable to make Btrfs based OS installation
> images reproducible? Or is Btrfs simply too complex and
> non-deterministic? [1]
>
> The main three problems with Btrfs right now for reproducibility are:
> a. many objects have uuids other than the volume uuid; and mkfs only
> lets us set the volume uuid
> b. atime, ctime, mtime, otime; and no way to make them all the same
> c. non-deterministic allocation of file extents, compression, inode
> assignment, logical and physical address allocation

d. generation, just pick a consistent default because the entire image
is made with mkfs and then never rw mounted so it's not a problem

> - Possibly disallow subvolumes and snapshots

There's no actual mechanism to do either of these with mkfs, so it's
not a problem. And if a sprout is created, it's fine for newly created
subvolumes to follow the usual behavior of having unique UUID and
incrementing generation. Thing is, the sprout will inherit the seeds
preset chunk uuid, which while it shouldn't cause a problem is a kind
of violation of uuid uniqueness; but ultimately I'm not sure how big
of a problem it is for such uuids to spread.



-- 
Chris Murphy


reproducible builds with btrfs seed feature

2018-10-13 Thread Chris Murphy
Is it practical and desirable to make Btrfs based OS installation
images reproducible? Or is Btrfs simply too complex and
non-deterministic? [1]

The main three problems with Btrfs right now for reproducibility are:
a. many objects have uuids other than the volume uuid; and mkfs only
lets us set the volume uuid
b. atime, ctime, mtime, otime; and no way to make them all the same
c. non-deterministic allocation of file extents, compression, inode
assignment, logical and physical address allocation

I'm imagining reproducible image creation would be a mkfs feature that
builds on Btrfs seed and --rootdir concepts to constrain Btrfs
features to maybe make reproducible Btrfs volumes possible:

- No raid
- Either all objects needing uuids can have those uuids specified by
switch, or possibly a defined set of uuids expressly for this use
case, or possibly all of them can just be zeros (eek? not sure)
- A flag to set all times the same
- Possibly require that target block device is zero filled before
creation of the Btrfs
- Possibly disallow subvolumes and snapshots
- Require the resulting image is seed/ro and maybe also a new
compat_ro flag to enforce that such Btrfs file systems cannot be
modified after the fact.
- Enforce a consistent means of allocation and compression

The end result is creating two Btrfs volumes would yield image files
with matching hashes.

If I had to guess, the biggest challenge would be allocation. But it's
also possible that such an image may have problems with "sprouts". A
non-removable sprout seems fairly straightforward and safe; but if a
"reproducible build" type of seed is removed, it seems like removal
needs to be smart enough to refresh *all* uuids found in the sprout: a
hard break from the seed.

Competing file systems, ext4 with make_ext4 fork, and squashfs. At the
moment I'm thinking it might be easier to teach squashfs integrity
checking than to make Btrfs reproducible.  But then I also think
restricting Btrfs features, and applying some requirements to
constrain Btrfs to make it reproducible, really enhances the Btrfs
seed-sprout feature.

Any thoughts? Useful? Difficult to implement?

Squashfs might be a better fit for this use case *if* it can be taught
about integrity checking. It does per file checksums for the purpose
of deduplication but those checksums aren't retained for later
integrity checking.

[1] problems of reproducible system images
https://reproducible-builds.org/docs/system-images/

[2] purpose and motivation for reproducible builds
https://reproducible-builds.org/

[3] who is involved?
https://reproducible-builds.org/who/#Qubes%20OS




-- 
Chris Murphy


Re: errors reported by btrfs-check

2018-10-11 Thread Chris Murphy
What version of btrfs-progs?


fail to read only scrub, WARNING: at fs/btrfs/transaction.c:1847 cleanup_transaction

2018-10-11 Thread Chris Murphy
btrfs-progs 4.17.1
kernel 4.18.12

I've got another Samsung SDHC card that's gone read only, and any
writes cause it to hang. So I've use blockdev --setro on all the
partitions and block device to make sure nothing can write to it, and
then mount with

# mount -o ro,nologreplay /dev/mmcblk0p3 /mnt/sd

There is a log tree entry in the super. Next I want to do a scrub to
see if there are any corruptions detected. I'm curious if any
corruptions have happened in transactions near the time the device
failed and went read only.

# btrfs scrub start -Bdr /mnt/sd

And that results in a warning and call trace. Seems like a bug.

[97696.976887] mmc0: new ultra high speed SDR104 SDHC card at address 59b4
[97696.980825] mmcblk0: mmc0:59b4 EB2MW 29.8 GiB
[97696.996102]  mmcblk0: p1 p2 p3
[100363.736000] r8169 :03:00.0: invalid large VPD tag 7f at offset 0
[103726.761878] BTRFS info (device mmcblk0p3): disabling log replay at
mount time
[103726.764476] BTRFS info (device mmcblk0p3): using free space tree
[103726.767036] BTRFS info (device mmcblk0p3): has skinny extents
[103726.811136] BTRFS info (device mmcblk0p3): enabling ssd optimizations
[103780.058008] BTRFS warning (device mmcblk0p3): Skipping commit of
aborted transaction.
[103780.065633] [ cut here ]
[103780.070470] BTRFS: Transaction aborted (error -28)
[103780.070631] WARNING: CPU: 0 PID: 6670 at
fs/btrfs/transaction.c:1847 cleanup_transaction+0x8a/0xd0 [btrfs]
[103780.075561] Modules linked in: mmc_block veth xt_nat
ipt_MASQUERADE xt_addrtype br_netfilter bridge stp llc ccm
nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables
ip6table_filter ip6_tables sunrpc vfat fat arc4 intel_rapl iwlmvm
intel_powerclamp coretemp mac80211 kvm_intel kvm snd_hda_codec_hdmi
irqbypass btusb snd_hda_codec_realtek iwlwifi crct10dif_pclmul
snd_hda_codec_generic btrtl btbcm crc32_pclmul ghash_clmulni_intel
btintel bluetooth iTCO_wdt iTCO_vendor_support
[103780.097996]  intel_cstate cfg80211 snd_hda_intel snd_hda_codec
ecdh_generic intel_xhci_usb_role_switch roles rfkill ir_rc6_decoder
snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm rc_rc6_mce
mei_txe mei ite_cir snd_timer snd rc_core pcc_cpufreq soundcore
intel_int0002_vgpio i2c_i801 lpc_ich dm_crypt btrfs libcrc32c xor
zstd_decompress zstd_compress xxhash raid6_pq i915 i2c_algo_bit
crc32c_intel drm_kms_helper sdhci_pci drm cqhci r8169 sdhci mii
mmc_core video pwm_lpss_platform pwm_lpss lz4 lz4_compress
[103780.117383] CPU: 0 PID: 6670 Comm: btrfs-transacti Not tainted
4.18.12-300.fc29.x86_64 #1
[103780.123984] Hardware name:  /NUC5PPYB, BIOS
PYBSWCEL.86A.0074.2018.0709.1332 07/09/2018
[103780.130401] RIP: 0010:cleanup_transaction+0x8a/0xd0 [btrfs]
[103780.136079] Code: 83 f8 01 77 63 f0 48 0f ba ab 08 23 00 00 02 72
1b 41 83 fc fb 0f 84 45 30 08 00 44 89 e6 48 c7 c7 70 6f 6c c0 e8 60
1f a8 ec <0f> 0b 44 89 e1 ba 37 07 00 00 4d 8d 65 28 48 89 ef 48 c7 c6
70 b0
[103780.142943] RSP: 0018:b45401483dd0 EFLAGS: 00010286
[103780.150032] RAX:  RBX: 90d8f4724000 RCX:
0006
[103780.157155] RDX: 0007 RSI: 0086 RDI:
90d93fc16930
[103780.164218] RBP: 90d928f99750 R08: 0038 R09:
0007
[103780.171238] R10:  R11: 0001 R12:
ffe4
[103780.178171] R13: 90d8c44ce600 R14: ffe4 R15:
90d8f4725df8
[103780.185134] FS:  () GS:90d93fc0()
knlGS:
[103780.192265] CS:  0010 DS:  ES:  CR0: 80050033
[103780.199458] CR2: 55764829c720 CR3: 00017420a000 CR4:
001006f0
[103780.206756] Call Trace:
[103780.214090]  ? finish_wait+0x80/0x80
[103780.221519]  btrfs_commit_transaction+0x86d/0x8b0 [btrfs]
[103780.228946]  ? join_transaction+0x22/0x3e0 [btrfs]
[103780.236448]  ? start_transaction+0x9c/0x3e0 [btrfs]
[103780.243856]  transaction_kthread+0x155/0x170 [btrfs]
[103780.251166]  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
[103780.258368]  kthread+0x112/0x130
[103780.265481]  ? kthread_create_worker_on_cpu+0x70/0x70
[103780.272535]  ret_from_fork+0x35/0x40
[103780.279436] ---[ end trace 7470f1b607c73b6c ]---
[103780.285841] BTRFS warning (device mmcblk0p3):
cleanup_transaction:1847: Aborting unused transaction(No space left).
[103780.289891] BTRFS info (device mmcblk0p3): delayed_refs has NO entry


-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Chris Murphy
On Wed, Oct 10, 2018 at 9:07 PM, Larkin Lowrey
 wrote:
> On 10/10/2018 10:51 PM, Chris Murphy wrote:
>>
>> On Wed, Oct 10, 2018 at 8:12 PM, Larkin Lowrey
>>  wrote:
>>>
>>> On 10/10/2018 7:55 PM, Hans van Kranenburg wrote:
>>>>
>>>> On 10/10/2018 07:44 PM, Chris Murphy wrote:
>>>>>
>>>>>
>>>>> I'm pretty sure you have to umount, and then clear the space_cache
>>>>> with 'btrfs check --clear-space-cache=v1' and then do a one time mount
>>>>> with -o space_cache=v2.
>>>>
>>>> The --clear-space-cache=v1 is optional, but recommended, if you are
>>>> someone who do not likes to keep accumulated cruft.
>>>>
>>>> The v2 mount (rw mount!!!) does not remove the v1 cache. If you just
>>>> mount with v2, the v1 data keeps being there, doing nothing any more.
>>>
>>>
>>> Theoretically I have the v2 space_cache enabled. After a clean umount...
>>>
>>> # mount -onospace_cache /backups
>>> [  391.243175] BTRFS info (device dm-3): disabling free space tree
>>> [  391.249213] BTRFS error (device dm-3): cannot disable free space tree
>>> [  391.255884] BTRFS error (device dm-3): open_ctree failed
>>
>> "free space tree" is the v2 space cache, and once enabled it cannot be
>> disabled with nospace_cache mount option. If you want to run with
>> nospace_cache you'll need to clear it.
>>
>>
>>> # mount -ospace_cache=v1 /backups/
>>> mount: /backups: wrong fs type, bad option, bad superblock on
>>> /dev/mapper/Cached-Backups, missing codepage or helper program, or other
>>> error
>>> [  983.501874] BTRFS info (device dm-3): enabling disk space caching
>>> [  983.508052] BTRFS error (device dm-3): cannot disable free space tree
>>> [  983.514633] BTRFS error (device dm-3): open_ctree failed
>>
>> You cannot go back and forth between v1 and v2. Once v2 is enabled,
>> it's always used regardless of any mount option. You'll need to use
>> btrfs check to clear the v2 cache if you want to use v1 cache.
>>
>>
>>> # btrfs check --clear-space-cache v1 /dev/Cached/Backups
>>> Opening filesystem to check...
>>> couldn't open RDWR because of unsupported option features (3).
>>> ERROR: cannot open file system
>>
>> You're missing the '=' symbol for the clear option, that's why it fails.
>>
>
> # btrfs check --clear-space-cache=v2 /dev/Cached/Backups
> Opening filesystem to check...
> Checking filesystem on /dev/Cached/Backups
> UUID: acff5096-1128-4b24-a15e-4ba04261edc3
> Clear free space cache v2
> Segmentation fault (core dumped)
>
> [  109.686188] btrfs[2429]: segfault at 68 ip 555ff6394b1c sp
> 7ffcc4733ab0 error 4 in btrfs[555ff637c000+ca000]
> [  109.696732] Code: ff e8 68 ed ff ff 8b 4c 24 58 4d 8b 8f c7 01 00 00 4c
> 89 fe 85 c0 0f 44 44 24 40 45 31 c0 89 44 24 40 48 8b 84 24 90 00 00 00 <8b>
> 40 68 49 29 87 d0 00 00 00 6a 00 55 48 8b 54 24 18 48 8b 7c 24
>
> That's btrfs-progs v4.17.1 on 4.18.12-200.fc28.x86_64.
>
> I appreciate the help and advice from everyone who has contributed to this
> thread. At this point, unless there is something for the project to gain
> from tracking down this trouble, I'm just going to nuke the fs and start
> over.

Is this a 68T file system? Seems excessive. For now you should be able
to use the new v2 space tree. I think Qu or some dev will want to know
why you're getting a crash trying to clear the v2 space cache. Maybe
try clearing the v1 first, then v2?  While v1 is default right now,
soonish the plan is to go to v2 by default but the inability to clear
is a bug worth investigation. I've just tried it on several of my file
systems and it clears without error and rebuilds at next mount with v2
option.

If it is the 68T file system, I don't expect a btrfs-image is going to
be easy to capture or deliver: you've got 95GiB of metadata!
Compressed that's still a ~30-45GiB image.


-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Chris Murphy
On Wed, Oct 10, 2018 at 8:12 PM, Larkin Lowrey
 wrote:
> On 10/10/2018 7:55 PM, Hans van Kranenburg wrote:
>>
>> On 10/10/2018 07:44 PM, Chris Murphy wrote:
>>>
>>>
>>> I'm pretty sure you have to umount, and then clear the space_cache
>>> with 'btrfs check --clear-space-cache=v1' and then do a one time mount
>>> with -o space_cache=v2.
>>
>> The --clear-space-cache=v1 is optional, but recommended, if you are
>> someone who do not likes to keep accumulated cruft.
>>
>> The v2 mount (rw mount!!!) does not remove the v1 cache. If you just
>> mount with v2, the v1 data keeps being there, doing nothing any more.
>
>
> Theoretically I have the v2 space_cache enabled. After a clean umount...
>
> # mount -onospace_cache /backups
> [  391.243175] BTRFS info (device dm-3): disabling free space tree
> [  391.249213] BTRFS error (device dm-3): cannot disable free space tree
> [  391.255884] BTRFS error (device dm-3): open_ctree failed

"free space tree" is the v2 space cache, and once enabled it cannot be
disabled with nospace_cache mount option. If you want to run with
nospace_cache you'll need to clear it.


>
> # mount -ospace_cache=v1 /backups/
> mount: /backups: wrong fs type, bad option, bad superblock on
> /dev/mapper/Cached-Backups, missing codepage or helper program, or other
> error
> [  983.501874] BTRFS info (device dm-3): enabling disk space caching
> [  983.508052] BTRFS error (device dm-3): cannot disable free space tree
> [  983.514633] BTRFS error (device dm-3): open_ctree failed

You cannot go back and forth between v1 and v2. Once v2 is enabled,
it's always used regardless of any mount option. You'll need to use
btrfs check to clear the v2 cache if you want to use v1 cache.


>
> # btrfs check --clear-space-cache v1 /dev/Cached/Backups
> Opening filesystem to check...
> couldn't open RDWR because of unsupported option features (3).
> ERROR: cannot open file system

You're missing the '=' symbol for the clear option, that's why it fails.




-- 
Chris Murphy


Re: Recovery options for damaged beginning of the filesystem

2018-10-10 Thread Chris Murphy
On Tue, Oct 9, 2018 at 10:47 PM, Shapranov Vladimir
 wrote:
> Hi,
>
> I've got a filesystem with first ~50Mb accidentally dd'ed.
>
> "btrfs check" fails with a following error (regardless of "-s"):
> checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090
> checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090
> checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F
> checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F
> bytenr mismatch, want=21037056, have=13515002166904211737
> ERROR: cannot read chunk root
> ERROR: cannot open file system
>
> "mount -o ro /dev/sdf1 /mnt/tmp" fails, while "mount -o ro,subvol=X /mnt/tmp" 
> succeeds for "/" and couple subvolumes.

What do you get for 'btrfs rescue super -v /dev/sdf1' ?

I thought the kernel code will not mount a Btrfs if the first super is
not present or valid (checksum match)?




-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Chris Murphy
On Wed, Oct 10, 2018 at 12:31 PM, Larkin Lowrey
 wrote:

> Interesting, because I do not see any indications of any other errors. The
> fs is backed by an mdraid array and the raid checks always pass with no
> mismatches, edac-util doesn't report any ECC errors, smartd doesn't report
> any SMART errors, and I never see any raid controller errors. I have the
> console connected through serial to a logging console server so if there
> were errors reported I would have seen them.

I think Holger is referring to the multiple reports like this:

[  817.883261] scsi_eh_0   S0   141  2 0x8000
[  817.66] Call Trace:
[  817.891391]  ? __schedule+0x253/0x860
[  817.895094]  ? scsi_try_target_reset+0x90/0x90
[  817.899631]  ? scsi_eh_get_sense+0x220/0x220
[  817.904045]  schedule+0x28/0x80
[  817.907260]  scsi_error_handler+0x1d2/0x5b0
[  817.911514]  ? __schedule+0x25b/0x860
[  817.915207]  ? scsi_eh_get_sense+0x220/0x220
[  817.919547]  kthread+0x112/0x130
[  817.922818]  ? kthread_create_worker_on_cpu+0x70/0x70
[  817.928015]  ret_from_fork+0x22/0x40


That isn't a SCSI controller or drive error itself; it's a capture of
a thread that's in the state of handling scsi errors (maybe).

I'm finding scsi_try_target_reset here at line 855
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_error.c

And also line 2143 for scsi_error_handler
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_error.c

Is the problem Btrfs on sysroot? Because if the sysroot file system is
entirely error free, I'd expect to eventually get a lot more error
information from the kernel even without sysrq+t rather than
faceplanting. Can you post the entire dmesg? The posted one starts at
~815 seconds, and the problems definitely start before then but as it
is we have nothing really to go on.


-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Chris Murphy
On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte
 wrote:
> On 10/10/18 17:44, Larkin Lowrey wrote:
> (..)
>>
>> About once a week, or so, I'm running into the above situation where
>> FS seems to deadlock. All IO to the FS blocks, there is no IO
>> activity at all. I have to hard reboot the system to recover. There
>> are no error indications except for the following which occurs well
>> before the FS freezes up:
>>
>> BTRFS warning (device dm-3): block group 78691883286528 has wrong amount
>> of free space
>> BTRFS warning (device dm-3): failed to load free space cache for block
>> group 78691883286528, rebuilding it now
>>
>> Do I have any options other the nuking the FS and starting over?
>
>
> Unmount cleanly & mount again with -o space_cache=v2.

I'm pretty sure you have to umount, and then clear the space_cache
with 'btrfs check --clear-space-cache=v1' and then do a one time mount
with -o space_cache=v2.

But anyway, to me that seems premature because we don't even know
what's causing the problem.

a. Freezing means there's a kernel bug. Hands down.
b. Is it freezing on the rebuild? Or something else?
c. I think the devs would like to see the output from btrfs-progs
v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in
particular something not related to free space cache.

Rebuilding either version of space cache requires successfully reading
(and parsing) the extent tree.


-- 
Chris Murphy


Re: CoW behavior when writing same content

2018-10-09 Thread Chris Murphy
On Tue, Oct 9, 2018 at 11:25 AM, Andrei Borzenkov  wrote:
> 09.10.2018 18:52, Chris Murphy пишет:

>>> In this case is root/big_file and snapshot/big_file still share the same 
>>> data?
>>
>> You'll be left with three files. /big_file and root/big_file will
>> share extents,
>
> How comes they share extents? This requires --reflink, is it default now?

Good catch. It's not the default. I meant to write that initially only

root/big_file and snapshot/big_file have shared extents

And the shared extents are lost when snapshot/big_file is
"overwritten" by the copy into snapshot/


>> and snapshot/big_file will have its own extents. You'd
>> need to copy with --reflink for snapshot/big_file to have shared
>> extents with /big_file - or deduplicate.
>>
> This still overwrites the whole file in the sense original file content
> of "snapshot/big_file" is lost. That new content happens to be identical
> and that new content will probably be reflinked does not change the fact
> that original file is gone.

Agreed.

-- 
Chris Murphy


Re: CoW behavior when writing same content

2018-10-09 Thread Chris Murphy
On Tue, Oct 9, 2018 at 8:48 AM, Gervais, Francois
 wrote:
> Hi,
>
> If I have a snapshot where I overwrite a big file but which only a
> small portion of it is different, will the whole file be rewritten in
> the snapshot? Or only the different part of the file?

Depends on how the application modifies files. Many applications write
out a whole new file with a pseudorandom filename, fsync, then rename.

>
> Something like:
>
> $ dd if=/dev/urandom of=/big_file bs=1M count=1024
> $ cp /big_file root/
> $ btrfs sub snap root snapshot
> $ cp /big_file snapshot/
>
> In this case is root/big_file and snapshot/big_file still share the same data?

You'll be left with three files. /big_file and root/big_file will
share extents, and snapshot/big_file will have its own extents. You'd
need to copy with --reflink for snapshot/big_file to have shared
extents with /big_file - or deduplicate.


-- 
Chris Murphy


qgroups not enabled, but perf stats reports btrfs_qgroup_release_data and btrfs_qgroup_free_delayed_ref

2018-10-08 Thread Chris Murphy
[chris@flap ~]$ sudo perf stat -e 'btrfs:*' -a sleep 70
##And then I loaded a few sites in Firefox early on in those 70 seconds.

 Performance counter stats for 'system wide':

 5  btrfs:btrfs_transaction_commit
29  btrfs:btrfs_inode_new
29  btrfs:btrfs_inode_request
25  btrfs:btrfs_inode_evict
 1,602  btrfs:btrfs_get_extent
 0  btrfs:btrfs_handle_em_exist
 1  btrfs:btrfs_get_extent_show_fi_regular
88  btrfs:btrfs_truncate_show_fi_regular
19  btrfs:btrfs_get_extent_show_fi_inline
 2  btrfs:btrfs_truncate_show_fi_inline
   189  btrfs:btrfs_ordered_extent_add
   189  btrfs:btrfs_ordered_extent_remove
 9  btrfs:btrfs_ordered_extent_start
   592  btrfs:btrfs_ordered_extent_put
 1,207  btrfs:__extent_writepage
 1,203  btrfs:btrfs_writepage_end_io_hook
25  btrfs:btrfs_sync_file
 0  btrfs:btrfs_sync_fs
 0  btrfs:btrfs_add_block_group
 1,508  btrfs:add_delayed_tree_ref
 1,498  btrfs:run_delayed_tree_ref
   379  btrfs:add_delayed_data_ref
   336  btrfs:run_delayed_data_ref
 1,887  btrfs:add_delayed_ref_head
 1,839  btrfs:run_delayed_ref_head
 0  btrfs:btrfs_chunk_alloc
 0  btrfs:btrfs_chunk_free
   794  btrfs:btrfs_cow_block
 6,982  btrfs:btrfs_space_reservation
 0  btrfs:btrfs_trigger_flush
 0  btrfs:btrfs_flush_space
   952  btrfs:btrfs_reserved_extent_alloc
 0  btrfs:btrfs_reserved_extent_free
 1,005  btrfs:find_free_extent
 1,005  btrfs:btrfs_reserve_extent
   816  btrfs:btrfs_reserve_extent_cluster
 1  btrfs:btrfs_find_cluster
 0  btrfs:btrfs_failed_cluster_setup
 1  btrfs:btrfs_setup_cluster
 5,952  btrfs:alloc_extent_state
 6,034  btrfs:free_extent_state
   374  btrfs:btrfs_work_queued
   362  btrfs:btrfs_work_sched
   362  btrfs:btrfs_all_work_done
   116  btrfs:btrfs_ordered_sched
 0  btrfs:btrfs_workqueue_alloc
 0  btrfs:btrfs_workqueue_destroy
 0  btrfs:btrfs_qgroup_reserve_data
   201  btrfs:btrfs_qgroup_release_data
 1,839  btrfs:btrfs_qgroup_free_delayed_ref
 0  btrfs:btrfs_qgroup_account_extents
 0  btrfs:btrfs_qgroup_trace_extent
 0  btrfs:btrfs_qgroup_account_extent
 0  btrfs:qgroup_update_counters
 0  btrfs:qgroup_update_reserve
 0  btrfs:qgroup_meta_reserve
 0  btrfs:qgroup_meta_convert
 0  btrfs:qgroup_meta_free_all_pertrans
 0  btrfs:btrfs_prelim_ref_merge
 0  btrfs:btrfs_prelim_ref_insert
 2,663  btrfs:btrfs_inode_mod_outstanding_extents
 0  btrfs:btrfs_remove_block_group
 0  btrfs:btrfs_add_unused_block_group
 0  btrfs:btrfs_skip_unused_block_group

  70.004723586 seconds time elapsed

[chris@flap ~]$


Seems like a lot of activity for just a few transactions, but what
really caught my eye here is the qgroup reporting for a file system
that has never had qgroups enabled. Is it expected?


Chris Murphy


Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm  wrote:
>
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid.

Yep. I'm referring to the drives, their firmware, cables, logic board,
its firmware, the power supply, power, etc. Btrfs is by nature
intolerant of corruption. Other file systems are more tolerant because
they don't know about it (although recent versions of XFS and ext4 are
now defaulting to checksummed metadata and journals).


>three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

RAID does not substantially reduce the chances of data loss. It's not
anything like a backup. It's an uptime enhancer. If you have backups,
and your primary storage dies, of course you can restore from backup
no problem, but it takes time and while the restore is happening,
you're not online - uptime is killed. If that's a negative, might want
to run RAID so you can keep working during the degraded period, and
instead of a restore you're doing a rebuild. But of course there is a
chance of failure during the degraded period. So you have to have a
backup anyway. At least with Btrfs/ZFS, there is another reason to run
with some replication like raid1 or raid5 and that's so that if
there's corruption or a bad sector, Btrfs doesn't just detect it, it
can fix it up with the good copy.

For what it's worth, make sure the drives have lower SCT ERC time than
the SCSI command timer. This is the same for Btrfs as it is for md and
LVM RAID. The command timer default is 30 seconds, and most drives
have SCT ERC disabled with very high recovery times well over 30
seconds. So either set SCT ERC to something like 70 deciseconds. Or
increase the command timer to something like 120 or 180 (either one is
absurdly high but what you want is for the drive to eventually give up
and report a discrete error message which Btrfs can do something
about, rather than do a SATA link reset in which case Btrfs can't do
anything about it).




-- 
Chris Murphy


Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm  wrote:
> On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:
>
>>
>> Then I strongly recommend to use the latest upstream kernel and progs
>> for btrfs. (thus using Debian Testing)
>>
>> And if anything went wrong, please report asap to the mail list.
>>
>> Especially for fs corruption, that's the ghost I'm always chasing for.
>> So if any corruption happens again (although I hope it won't happen), I
>> may have a chance to catch it.
>
> You got it
>> >
>> >> Anyway, enjoy your stable fs even it's not btrfs
>
>> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
>> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
>> > after the dust settled I realize I like the flexibility of BTRFS.
>> >
> I'm back to btrfs.
>
>> From the code aspect, the biggest difference is the chunk layout.
>> Due to the ext* block group usage, each block group header (except some
>> sparse bg) is always used, thus btrfs can't use them.
>>
>> This leads to highly fragmented chunk layout.
>
> The only thing I really understood is "highly fragmented" == not good
> . I might need to google these "chunk" thingies

Chunks are synonyms with block groups. They're like a super extent, or
extent of extents.

The block group is how Btrfs abstracts the logical address used most
everywhere in Btrfs land, and device + physical location of extents.
It's how a file is referenced only by on logical address, and doesn't
need to know either where the extent is located, or how many copies
there are. The block group allocation profile is what determines if
there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk
and where the copies are located. It's also fundamental to how device
add, remove, replace, file system resize, and balance all interrelate.


>> If your primary concern is to make the fs as stable as possible, then
>> keep snapshots to a minimal amount, avoid any functionality you won't
>> use, like qgroup, routinely balance, RAID5/6.
>
> So, is RAID5 stable enough ? reading the wiki there's a big fat
> warning about some parity issues, I read an article about silent
> corruption (written a while back), and chris says he can't recommend
> raid56 to mere mortals.

Depends on how you define stable. In recent kernels it's stable on
stable hardware, i.e. no lying hardware (actually flushes when it
claims it has), no power failures, and no failed devices. Of course
it's designed to help protect against a clear loss of a device, but
there's tons of stuff here that's just not finished including ejecting
bad devices from the array like md and lvm raids will do. Btrfs will
just keep trying, through all the failures. There are some patches to
moderate this but I don't think they're merged yet.

You'd also want to be really familiar with how to handle degraded
operation, if you're going to depend on it, and how to replace a bad
device. Last I refreshed my memory on it, it's advised to use "btrfs
device add" followed by "btrfs device remove" for raid56; whereas
"btrfs replace" is preferred for all other profiles. I'm not sure if
the "btrfs replace" issues with parity raid were fixed.

Metadata as raid56 shows a lot more problem reports than metadata
raid1, so there's something goofy going on in those cases. I'm not
sure how well understood they are. But other people don't have
problems with it.

It's worth looking through the archives about some things. Btrfs
raid56 isn't exactly perfectly COW, there is read-modify-write code
that means there can be overwrites. I vaguely recall that it's COW in
the logical layer, but the physical writes can end up being RMW or not
for sure COW.



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-20 Thread Chris Murphy
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann  wrote:
> Am 13.9.2018 14:35, schrieb Nikolay Borisov:
>>
>> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>>>
>>> OK, I will install kdump later and perform a dump after the hang.
>>>
>>> One more noob question beforehand: does this dump contain sensitive
>>> information, for example the luks encryption key for the disk etc? A
>>> Google search only brings up one relevant search result which can only
>>> be viewed with a redhat subscription...
>>
>>
>>
>> So a kdump will dump the kernel memory so it's possible that the LUKS
>> encryption keys could be extracted from that image. Bummer, it's
>> understandable why you wouldn't want to upload it :). In this case you'd
>> have to install also the 'crash' utility to open the crashdump and
>> extract the calltrace of the btrfs process. The rough process should be :
>>
>>
>> crash 'path to vm linux' 'path to vmcore file', then once inside the
>> crash utility :
>>
>> set , you can acquire the pid by issuing 'ps'
>> which will give you a ps-like output of all running processes at the
>> time of crash. After the context has been set you can run 'bt' which
>> will give you a backtrace of the send process.
>>
>>
>>
>>>
>>> Best regards,
>>> Jürgen
>>>
>>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov
>>> :
>>>
>>>> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>>>>>
>>>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
>>>>> after the hang because I started the loop before the hang. The dmesg
>>>>> output should show the hanging tasks from second 346 on or so. Still
>>>>> not
>>>>> useful?
>>>>>
>>>>
>>>> So from 346 it's evident that transaction commit is waiting for
>>>> commit_root_sem to be acquired. So something else is holding it and not
>>>> giving the transaction chance to finish committing. Now the only place
>>>> where send acquires this lock is in find_extent_clone around the  call
>>>> to extent_from_logical. The latter basically does an extent tree search
>>>> and doesn't loop so can't possibly deadlock. Furthermore I don't see any
>>>> userspace processes being hung in kernel space.
>>>>
>>>> Additionally looking at the userspace processes they indicate that
>>>> find_extent_clone has finished and are blocked in send_write_or_clone
>>>> which does the write. But I guess this actually happens before the hang.
>>>>
>>>>
>>>> So at this point without looking at the stacktrace of the btrfs send
>>>> process after the hung has occurred I don't think much can be done
>>>
>>>
> I know this is probably not the correct list to ask this question but maybe
> someone of the devs can point me to the right list?
>
> I cannot get kdump to work. The crashkernel is loaded and everything is
> setup for it afaict. I asked a question on this over at stackexchange but no
> answer yet.
> https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing
>
> So i did a little digging and added some debug printk() statements to see
> whats going on and it seems that panic() is never called. maybe the second
> stack trace is the reason?
> Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN
>
> Could someone please tell me where I can report this problem and get some
> help on this topic?


Try kexec mailing list. They handle kdump.

http://lists.infradead.org/mailman/listinfo/kexec



-- 
Chris Murphy


Re: inline extents

2018-09-19 Thread Chris Murphy
Adding fsdevel@, linux-ext4, and btrfs@ (which has a separate subject
on this same issue)



On Wed, Sep 19, 2018 at 7:45 PM, Dave Chinner  wrote:
>On Wed, Sep 19, 2018 at 10:23:38AM -0600, Chris Murphy wrote:
>> Fedora 29 has a new feature to test if boot+startup fails, so the
>> bootloader can do a fallback at next boot, to a previously working
>> entry. Part of this means GRUB (the bootloader code, not the user
>> space code) uses "save_env" to overwrite the 1024 data bytes with
>> updated environment information.
>
> That's just broken. Illegal. Completely unsupportable. Doesn't
> matter what the filesystem is, nobody is allowed to write directly
> to the block device a filesystem owns.

Yeah, the word I'm thinking of is abomination.

However in their defense, grubenv and the 'save_env' command are old features:

line 3638 @node Environment block
http://git.savannah.gnu.org/cgit/grub.git/tree/docs/grub.texi

"For safety reasons, this storage is only available when installed on a plain
disk (no LVM or RAID), using a non-checksumming filesystem (no ZFS), and
using BIOS or EFI functions (no ATA, USB or IEEE1275)."

I haven't checked how it tests for this. But by now, it should list
the supported file systems, rather than what's exempt. That's a
shorter list.


> ext4 has inline data, too, so there's every chance grub will corrupt
> ext4 filesystems with tit's wonderful new feature. I'm not sure if
> the ext4 metadata cksums cover the entire inode and inline data, but
> if they do it's the same problem as btrfs.

I don't see inline used with a default mkfs, but I do see metadata_csum

e2fsprogs-1.44.3-1.fc29.x86_64

Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent 64bit flex_bg sparse_super large_file huge_file
dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl

>
>> For XFS, I'm not sure how the inline extent is saved, and whether
>> metadata checksumming includes or excludes the inline extent.
>
> When XFS implements this, it will be like btrfs as the data will be
> covered by the metadata CRCs for the inode, and so writing directly
> to it would corrupt the inode and render it unreadable by the
> filesystem.

Good to know.


>
>> I'm also kinda ignoring the reflink ramifications of this behavior,
>> for now. Let's just say even if there's no corruption I'm really
>> suspicious of bootloader code writing anything, even what seems to be
>> a simple overwrite of two sectors.
>
> You're not the only one
>
> Like I said, it doesn't matter what the filesystem is, overwriting
> file data by writing directly to the block device is not
> supportable. It's essentially a filesystem corruption vector, and
> grub needs to have that functionality removed immediately.

I'm in agreement with respect to the more complex file systems. We've
already realized the folly of the bootloader being unable to do
journal replay, ergo it doesn't really for sure have a complete
picture of the file system anyway. That's suboptimal when it results
in boot failure. But if it were going to use stale file system
information, get a wrong idea of the file system, and then use that to
do even 1024 bytes of writes? No, no, and no.

Meanwhile, also in Fedoraland, it's one of the distros where grubenv
and grub.cfg stuff is on the EFI System partition, which is FAT. This
overwrite behavior will work there, but even this case is a kind of
betrayal that the file is being modified, without its metadata being
updated. I think it's an old era hack that by today's standards simply
isn't good enough. I'm a little surprised that all UEFI
implementations permit arbitrary writes from the pre-boot environment
to arbitrary block devices, even with Secure Boot enabled. That seems
specious.

I know some of the file systems have reserve areas for bootloader
stuff. I'm not sure if that's preferred over bootloaders just getting
their own partition and controlling it stem to stern however they
want.


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:
> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>
> Does anyone know if this is still a problem on Btrfs if grubenv has
> xattr +C set? In which case it should be possible to overwrite and
> there's no csums that are invalidated.

I'm wrong.

$ sudo grub2-editenv --verbose grubenv create
[sudo] password for chris:
[chris@f29h ~]$ ll
-rw-r--r--. 1 root  root 1024 Sep 18 13:37 grubenv
[chris@f29h ~]$ stat -f grubenv
  File: "grubenv"
ID: ac9ba8ecdce5b017 Namelen: 255 Type: btrfs
Block size: 4096   Fundamental block size: 4096
Blocks: Total: 46661632   Free: 37479747   Available: 37422535
Inodes: Total: 0  Free: 0
[chris@f29h ~]$ sudo filefrag -v grubenv
Filesystem type is: 9123683e
File size of grubenv is 1024 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..4095:  0..  4095:   4096:
last,not_aligned,inline,eof
grubenv: 1 extent found
[chris@f29h ~]$

So it's an inline extent, which means nocow doesn't apply. It's
metadata so it *must* be COW. And any overwrite would trigger a
metadata checksum error.

First I'd argue it should refuse to create the file on Btrfs. But if
it does create grubenv, instead it should know that on Btrfs it must
redirect it to the appropriate btrfs reserved area (no idea how this
works) rather than to a file.



-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 1:11 PM, Goffredo Baroncelli  wrote:


>> I think it's a problem, and near as I can tell it'll be a problem for
>> all kinds of complex storage. I don't see how the bootloader itself
>> can do an overwrite onto raid5 or raid6.
>
>
>> That's certainly supported by GRUB for reading
> Not yet, I am working on that [1]


Sorry! I meant mdadm raid56. It definitely can read that format for
some time and even degraded! It's pretty cool. But I see no way that
it's sane to have the bootloader write to such a volume.

I've run into some issue where grub2-mkconfig and grubby, can change
the grub.cfg, and then do a really fast reboot without cleanly
unmounting the volume - and what happens? Can't boot. The bootloader
can't do log replay so it doesn't see the new grub.cfg at all. If all
you do is mount the volume and unmount, log replay happens, the fs
metadata is all fixed up just fine, and now the bootloader can see it.
This same problem can happen with the kernel and initramfs
installations.

(Hilariously the reason why this can happen is because of a process
exempting itself from being forcibly killed by systemd *against* the
documented advice of systemd devs that you should only do this for
processes not on rootfs; but as a consequence of this process doing
the wrong thing, systemd at reboot time ends up doing an unclean
unmount and reboot because it won't kill the kill exempt process.)

So *already* we have file systems that are becoming too complicated
for the bootloader to reliably read, because they cannot do journal
relay, let alone have any chance of modifying (nor would I want them
to do this). So yeah I'm, very rapidly becoming opposed to grubenv on
anything but super simple volumes like maybe ext4 without a journal
(extents are nice); or even perhaps GRUB should just implement its own
damn file system and we give it its own partition - similar to BIOS
Boot - but probably a little bigger


>
>> but is the bootloader overwrite of gruvenv going to
>> recompute parity and write to multiple devices? Eek!
>
> Recompute the parity should not be a big deal. Updating all the (b)trees 
> would be a too complex goal.

I think it's just asking for trouble. Sometimes the best answer ends
up being no, no and definitely no.

-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 1:01 PM, Andrei Borzenkov  wrote:
> 18.09.2018 21:57, Chris Murphy пишет:
>> On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  
>> wrote:
>>> 18.09.2018 08:37, Chris Murphy пишет:
>>
>>>> The patches aren't upstream yet? Will they be?
>>>>
>>>
>>> I do not know. Personally I think much easier is to make grub location
>>> independent of /boot, allowing grub be installed in separate partition.
>>> This automatically covers all other cases (like MD, LVM etc).
>>
>> The only case where I'm aware of this happens is Fedora on UEFI where
>> they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
>> upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
>> seen it elsewhere (except Fedora on UEFI).
>>
>> I'm not sure this is much easier. Yet another volume that would be
>> persistently mounted? Where? A nested mount at /boot/grub? I'm not
>> liking that at all. Even Windows and macOS have saner and simpler to
>> understand booting methods than this.
>>
>>
> That's exactly what Windows ended up with - separate boot volume with
> bootloader related files.

The OEM installer will absolutely install to a single partition. If
you point it to a blank drive on BIOS it will preferentially create a
"system" volume that's used for booting. But it's not mandatory. On
UEFI, it doesn't create a "system" volume, just "recovery" is ~500M
and "reserved" 16M. The reserved partition is blank unless you've done
some resizing on the main volume. The recovery volume contains
Winre.wim which is used for doing resets. If you blow away that
partition, you can still boot, but you can't do resets.

-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn
 wrote:

> It actually is independent of /boot already.  I've got it running just fine
> on my laptop off of the EFI system partition (which is independent of my
> /boot partition), and thus have no issues with handling of the grubenv file.
> The problem is that all the big distros assume you want it in /boot, so they
> have no option for putting it anywhere else.
>
> Actually installing it elsewhere is not hard though, you just pass
> `--boot-directory=/wherever` to the `grub-install` script and turn off your
> distributions automatic reinstall mechanism so it doesn't get screwed up by
> the package manager when the GRUB package gets updated. You can also make
> `/boot/grub` a symbolic link pointing to the real GRUB directory, so that
> you don't have to pass any extra options to tools like grub-reboot or
> grub-set-default.

This is how Fedora builds their signed grubx64.efi to behave. But you
cannot ever run grub-install on a Secure Boot enabled computer, or you
now have to learn all about signing your own binaries. I don't even
like doing that, let alone saner users.

So for those distros that support Secure Boot, in practice you're
stuck with the behavior of their prebuilt GRUB binary that goes on the
ESP.


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  wrote:
> 18.09.2018 08:37, Chris Murphy пишет:

>> The patches aren't upstream yet? Will they be?
>>
>
> I do not know. Personally I think much easier is to make grub location
> independent of /boot, allowing grub be installed in separate partition.
> This automatically covers all other cases (like MD, LVM etc).

The only case where I'm aware of this happens is Fedora on UEFI where
they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
seen it elsewhere (except Fedora on UEFI).

I'm not sure this is much easier. Yet another volume that would be
persistently mounted? Where? A nested mount at /boot/grub? I'm not
liking that at all. Even Windows and macOS have saner and simpler to
understand booting methods than this.


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 11:15 AM, Goffredo Baroncelli
 wrote:
> On 18/09/2018 06.21, Chris Murphy wrote:
>> b. The bootloader code, would have to have sophisticated enough Btrfs
>> knowledge to know if the grubenv has been reflinked or snapshot,
>> because even if +C, it may not be valid to overwrite, and COW must
>> still happen, and there's no way the code in GRUB can do full blow COW
>> and update a bunch of metadata.
>
> And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
> it a so big problem that the data is changed in all the snapshots ?
> It would be interested if the same problem happens for a swap file.

I think it's an abomination :-) It totally perverts the idea of
reflinks and snapshots and blurs the line between domains. Is it a
user file or not and are these user space commands or not and are they
reliable or do they have exceptions?

I have a boot subvolume mounted at /boot, and this boot subvolume gets
snapshot, and if GRUB can overwrite grubenv, it overwrites the
purported GRUB state information in every one of those boots, going
back maybe months, even when these are read only subvolumes.

I think it's a problem, and near as I can tell it'll be a problem for
all kinds of complex storage. I don't see how the bootloader itself
can do an overwrite onto raid5 or raid6. That's certainly supported by
GRUB for reading, but is the bootloader overwrite of gruvenv going to
recompute parity and write to multiple devices? Eek!


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-17 Thread Chris Murphy
On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov  wrote:
> 18.09.2018 07:21, Chris Murphy пишет:
>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  
>> wrote:
>>> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>>>
>>> Does anyone know if this is still a problem on Btrfs if grubenv has
>>> xattr +C set? In which case it should be possible to overwrite and
>>> there's no csums that are invalidated.
>>>
>>> I kinda wonder if in 2018 it's specious for, effectively out of tree
>>> code, to be making modifications to the file system, outside of the
>>> file system.
>>
>> a. The bootloader code (pre-boot, not user space setup stuff) would
>> have to know how to read xattr and refuse to overwrite a grubenv
>> lacking xattr +C.
>> b. The bootloader code, would have to have sophisticated enough Btrfs
>> knowledge to know if the grubenv has been reflinked or snapshot,
>> because even if +C, it may not be valid to overwrite, and COW must
>> still happen, and there's no way the code in GRUB can do full blow COW
>> and update a bunch of metadata.
>>
>> So answering my own question, this isn't workable. And it seems the
>> same problem for dm-thin.
>>
>> There are a couple of reserve locations in Btrfs at the start and I
>> think after the first superblock, for bootloader embedding. Possibly
>> one or both of those areas could be used for this so it's outside the
>> file system. But other implementations are going to run into this
>> problem too.
>>
>
> That's what SUSE grub2 version does - it includes patches to redirect
> writes on btrfs to reserved area. I am not sure how it behaves in case
> of multi-device btrfs though.

The patches aren't upstream yet? Will they be?

They redirect writes to grubenv specifically? Or do they use the
reserved areas like a hidden and fixed location for what grubenv would
contain?

I guess the user space grub-editenv could write to grubenv, which even
if COW, GRUB can pick up that change. But GRUB itself writes its
changes to a reserved area.

Hmmm. Complicated.

-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-17 Thread Chris Murphy
On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:
> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>
> Does anyone know if this is still a problem on Btrfs if grubenv has
> xattr +C set? In which case it should be possible to overwrite and
> there's no csums that are invalidated.
>
> I kinda wonder if in 2018 it's specious for, effectively out of tree
> code, to be making modifications to the file system, outside of the
> file system.

a. The bootloader code (pre-boot, not user space setup stuff) would
have to know how to read xattr and refuse to overwrite a grubenv
lacking xattr +C.
b. The bootloader code, would have to have sophisticated enough Btrfs
knowledge to know if the grubenv has been reflinked or snapshot,
because even if +C, it may not be valid to overwrite, and COW must
still happen, and there's no way the code in GRUB can do full blow COW
and update a bunch of metadata.

So answering my own question, this isn't workable. And it seems the
same problem for dm-thin.

There are a couple of reserve locations in Btrfs at the start and I
think after the first superblock, for bootloader embedding. Possibly
one or both of those areas could be used for this so it's outside the
file system. But other implementations are going to run into this
problem too.

I'm not sure how else to describe state. If NVRAM is sufficiently wear
resilient enough to have writes to it possibly every day, for every
boot, to indicate boot success/fail.

-- 
Chris Murphy


GRUB writing to grubenv outside of kernel fs code

2018-09-17 Thread Chris Murphy
https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F

Does anyone know if this is still a problem on Btrfs if grubenv has
xattr +C set? In which case it should be possible to overwrite and
there's no csums that are invalidated.

I kinda wonder if in 2018 it's specious for, effectively out of tree
code, to be making modifications to the file system, outside of the
file system.


-- 
Chris Murphy


Re: btrfs problems

2018-09-16 Thread Chris Murphy
re's a known highly tested
backport, because they want "The Behavior" to be predictable, both
good and bad. That is not a model well suited for a file system that's
in Btrfs really active development state. It's better now than it was
even a couple years ago, where I'd say: just don't use RHEL or Debian
or anything with old kernels except for experimenting; it's not worth
the hassle; you're inevitably gonna have to use a newer kernel because
all the Btrfs devs are busy making metric shittonnes of fixes in the
mainline version. Today, it's not as bad as that. But still 4.9 is old
in Btrfs terms. Should it be stable? For *your* problem for sure
because that's just damn strange and something very goofy is going on.
But is it possible there's a whole series of bugs happening in
sequence that results in this kind of corruption? No idea. Maybe.

And that's the main reason why quite a lot of users on this list use
Fedora, Arch, Gentoo - so they're using the newest stable or even
mainline rc kernels.

And so if you want to run any file system, including Btrfs, in
production with older kernels,  you pick a distro that's doing that
work. And right now it's openSUSE and SUSE that have the most Btrfs
developers supporting 4.9 and 4.14 kernels and Btrfs. Most of those
users are getting distro support, I don't often see SUSE users on
here.

OpenZFS is a different strategy because they're using out of tree
code. So you can run older kernels, and compile the current openzfs
code base against your older kernel. In effect you're using an older
distro kernel, but with new file system code base supported by that
upstream.



-- 
Chris Murphy


Re: Move data and mount point to subvolume

2018-09-16 Thread Chris Murphy
On Sun, Sep 16, 2018 at 12:40 PM, Rory Campbell-Lange
 wrote:

> Thanks very much for spotting my error, Chris.
>
> # mount | grep bkp
> /dev/mapper/cdisk2 on /bkp type btrfs
> (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/)
>
> # btrfs subvol list /bkp
> ID 258 gen 313636 top level 5 path backup
>
> I'm a bit confused about the difference between / and backup, which is
> at /bkp/backup.


top level, subvolid=5, subvolid=0, subvol=/, FS_TREE are all the same
thing. This is the subvolume that's created at mkfs time, it has no
name, it can't be deleted, and at mkfs time if you do

# btrfs sub get-default 
ID 5 (FS_TREE)

So long as you haven't changed the default subvolume, the top level
subvolume is what gets mounted, unless you use "-o subvol=" or "-o
subvolid=" mount option.

If you do
# btrfs sub list -ta /bkp

It might become a bit more clear what the layout is on disk. And for
an even more verbose output you can do:

# btrfs insp dump-t -t fs_tree /dev/### for this you need to
specify device not mountpoint, you don't need to umount, it's a read
only command


Anything that in the "top level" or the "file system root" you will
see listed. The first number is the inode, you'll see 256 is a special
inode for subvolumes. You can do 'ls -li' and compare. Any subvolume
you create is not FS_TREE, it is a "file tree". And note that each
subvolume has it's own pile of inode numbers, meaning
files/directories only have unique inode numbers *in a given
subvolume*. Those inode numbers start over in a new subvolume.

Subvolumes share extent, chunk, csum, uuid and other trees, so a
subvolume is not a completely isolated "file system".


>
> Anyhow I've verified I can snapshot /bkp/backup to another subvolume.
> This means I don't need to move anything, simply remount /bkp at
> /bkp/backup.

Uhh, that's the reverse of what you said in the first message. I'm not
sure what you want to do. It sounds like you want to mount the
subvolume "backup" at /bkp/ so that all the other files/dirs on this
Btrfs volume are not visible through the /bkp/ mount path?

Anyway if you want to explicitly mount the subvolume "backup"
somewhere, you use -o subvol=backup to specify "the subvolume named
backup, not the top level subvolume".





>
> Presumably I can therefore remount /bkp at subvolume /backup?
>
> # btrfs subvolume show /bkp/backup | egrep -i 'name|uuid|subvol'
> Name:   backup
> UUID:   d17cf2ca-a6db-ca43-8054-1fd76533e84b
> Parent UUID:-
> Received UUID:  -
> Subvolume ID:   258
>
> My fstab is presently
>
> UUID=da90602a-b98e-4f0b-959a-ce431ac0cdfa /bkp  btrfs  
> noauto,noatime,compress=lzo 0  2
>
> I guess it would now be
>
> UUID=d17cf2ca-a6db-ca43-8054-1fd76533e84b /bkp  btrfs  
> noauto,noatime,compress=lzo 0  2

No you can't mount by subvolume UUID. You continue to specify the
volume UUID, but then add a mount option


noauto,noatime,compress=lzo,subvol=backup

or

noauto,noatime,compress=lzo,subvolid=258


The advantage of subvolid is that it doesn't change when you rename
the subvolume.


>
>> If you snapshot a subvolume, which itself contains subvolumes, the
>> nested subvolumes are not snapshot. In the snapshot, the nested
>> subvolumes are empty directories.
>>
>> >
>> > # btrfs fi du -s /bkp/backup-subvol/backup
>> >  Total   Exclusive  Set shared  Filename
>> > ERROR: cannot check space of '/bkp/backup-subvol/backup': Inappropriate
>> > ioctl for device
>>
>> That's a bug in older btrfs-progs. It's been fixed, but I'm not sure
>> what version, maybe by 4.14?
>
> Sounds about right -- my version is 4.7.3.

It's not dangerous to use it (maybe --repair is more dangerous but
don't use it without advice first, no matter version). You just don't
get new features and bug fixes. It's also not dangerous to use
something much newer, again if the user space tools are very new and
the kernel is old, you just don't get certain features.




-- 
Chris Murphy


Re: btrfs problems

2018-09-16 Thread Chris Murphy
On Sun, Sep 16, 2018 at 7:58 AM, Adrian Bastholm  wrote:
> Hello all
> Actually I'm not trying to get any help any more, I gave up BTRFS on
> the desktop, but I'd like to share my efforts of trying to fix my
> problems, in hope I can help some poor noob like me.

There's almost no useful information provided for someone to even try
to reproduce your results, isolate cause and figure out the bugs.

No kernel version. No btrfs-progs version. No description of the
hardware and how it's laid out, and what mkfs and mount options are
being used. No one really has the time to speculate.


>BTRFS check --repair is not recommended

Right. So why did you run it anyway?

man btrfs check:

Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user


It is always a legitimate complaint, despite this warning, if btrfs
check --repair makes things worse, because --repair shouldn't ever
make things worse. But Btrfs repairs are complicated, and that's why
the warning is there. I suppose the devs could have made the flag
--riskyrepair but I doubt this would really slow users down that much.
A big part of --repair fixes weren't known to make things worse at the
time, and edge cases where it made things worse kept popping up, so
only in hindsight does it make sense --repair maybe could have been
called something different to catch the user's attention.

But anyway, I see this same sort of thing on the linux-raid list all
the time. People run into trouble, and they press full forward making
all kinds of changes, each change increases the chance of data loss.
And then they come on the list with WTF messages. And it's always a
lesson in patience for the list regulars and developers... if only
you'd come to us with questions sooner.

> Please have a look at the console logs.

These aren't logs. It's a record of shell commands. Logs would include
kernel messages, ideally all of them. Why is device 3 missing? We have
no idea. Most of Btrfs code is in the kernel, problems are reported by
the kernel. So we need kernel messages, user space messages aren't
enough.

Anyway, good luck with openzfs, cool project.


-- 
Chris Murphy


Re: Move data and mount point to subvolume

2018-09-16 Thread Chris Murphy
> So I did this:
>
> btrfs subvol snapshot /bkp /bkp/backup-subvol
>
> strangely while /bkp/backup has lots of files in it,
> /bkp/backup-subvol/backup has none.
>
> # btrfs subvol list /bkp
> ID 258 gen 313585 top level 5 path backup
> ID 4782 gen 313590 top level 5 path backup-subvol

OK so previously you said "/bkp which is a top level subvolume. There
are no other subvolumes."

But in fact backup is already a subvolume. So now it's confusing what
you were asking for in the first place, maybe you didn't realize
backup is not a dir but it is a subvolume.

If you snapshot a subvolume, which itself contains subvolumes, the
nested subvolumes are not snapshot. In the snapshot, the nested
subvolumes are empty directories.


>
> # btrfs fi du -s /bkp/backup-subvol/backup
>  Total   Exclusive  Set shared  Filename
> ERROR: cannot check space of '/bkp/backup-subvol/backup': Inappropriate
> ioctl for device

That's a bug in older btrfs-progs. It's been fixed, but I'm not sure
what version, maybe by 4.14?


>
> Any ideas about what could be going on?
>
> In the mean time I'm trying:
>
> btrfs subvol create /bkp/backup-subvol
> cp -prv --reflink=always /bkp/backup/* /bkp/backup-subvol/

Yeah that will take a lot of writes that are not necessary, now that
you see backup is a subvolume already. If you want a copy of it, just
snapshot it.

-- 
Chris Murphy


Re: Move data and mount point to subvolume

2018-09-16 Thread Chris Murphy
On Sun, Sep 16, 2018 at 5:14 AM, Rory Campbell-Lange
 wrote:
> Hi
>
> We have a backup machine that has been happily running its backup
> partitions on btrfs (on top of a luks encrypted disks) for a few years.
>
> Our backup partition is on /bkp which is a top level subvolume.
> Data, RAID1: total=2.52TiB, used=1.36TiB
> There are no other subvolumes.

and

> /dev/mapper/cdisk2 on /bkp type btrfs 
> (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/)

I like Hans' 2nd email advice to snapshot the top level subvolume.

I would start out with:
btrfs sub snap -r /bkp /bkp/toplevel.ro

And that way I shouldn't be able to F this up irreversibly if I make a
mistake. :-D And then do another snapshot that's rw:

btrfs sub snap /bkp /bkp/bkpsnap
cd /bkp/bkpsnap

Now remove everything except "backupdir". Then move everything out of
backupdir including any hidden files. Then rmdir backupdir. Then you
can rename the snapshot/subvolume
cd ..
mv bkpsnap backup

That's less metadata writes than creating a new subvolume, and reflink
copying the backup dir, e.g. cp -a --reflink /bkp/backupdir
/bkp/backupsubvol

That could take a long time because all the metadata is fully read,
modified (new inodes) and written out.

But either way it should work.

-- 
Chris Murphy


Re: state of btrfs snapshot limitations?

2018-09-14 Thread Chris Murphy
On Fri, Sep 14, 2018 at 3:05 PM, James A. Robinson
 wrote:

> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
>
> talks about the basic snapshot capabilities of btrfs and led
> me to look up what, if any, limits might apply.  I find some
> threads from a few years ago that talk about limiting the
> number of snapshots for a volume to 100.

It does seem variable and I'm not certain what the pattern is that
triggers pathological behavior. There's a container thread about a
year ago with someone using docker on Btrfs with more than 100K
containers, per day, but I don't know the turn over rate. That person
does say it's deletion that's expensive but not intolerably so.

My advice is you come up with as many strategies as you can implement.
Because if one strategy starts to implode with terrible performance,
you can just bail on it (or try fixing it, or submitting bug reports
to make Btrfs better down the road, etc.), and yet you still have one
or more other strategies that are still viable.

By strategy, you might want to implement both your ideal and
conservative approaches, and also something in the middle. Also, it's
reasonable to mirror those strategies on a different storage stack,
e.g. LVM thin volumes and XFS. LVM thin volumes are semi-cheap to
create, and semi-cheap to delete; where Btrfs snapshots are almost
free to create, and expensive to delete (varies depending on changes
in it or the subvolume it's created from). But if the LVM thin pool's
metadata pool runs out of space, it's big trouble. I expect to lose
all the LV's if that ever happens. Also, this strategy doesn't have
send/receive, so ordinary use of rsync is expensive since it reads and
compares both source and destination. The first answer for this
question contains a possible work around depending on hard links.

https://serverfault.com/questions/489289/handling-renamed-files-or-directories-in-rsync


With Btrfs big issues for scalability are the extent tree, which is
shared among all snapshots and subvolumes. Therefore, the bigger the
file system gets, in effect the more fragile the extent tree becomes.
The other thing is btrfs check is super slow with large volumes, some
people have dozen or more TiB file systems that take days to check.

I also agree with the noatime suggestion from Hans. Note this is a per
subvolume mount time option, so if you're using the subvol= or
subvolid= mount options, you need to noatime every time, once per file
system isn't enough.



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Chris Murphy
(resend to all)

On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov  wrote:
>
>
> On 13.09.2018 18:30, Chris Murphy wrote:
>> This is the 2nd or 3rd thread containing hanging btrfs send, with
>> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
>> and the other I can't find at the moment. In that case though the hang
>> is reproducible in 4.14.x and weirdly it only happens when a snapshot
>> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
>> wrong.
>>
>> I have snapshots with a few reflinks (cp --reflink and also
>> deduplication), and I see maybe 15-30 second hangs where nothing is
>> apparently happening (in top or iotop), but I'm also not seeing any
>> blocked tasks or high CPU usage. Perhaps in my case it's just
>> recovering quickly.
>>
>> Are there any kernel config options in "# Debug Lockups and Hangs"
>> that might hint at what's going on? Some of these are enabled in
>> Fedora debug kernels, which are built practically daily, e.g. right
>> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
>> translates to git 54eda9df17f3.
>
> If it's a lock-related problem then you need Lock Debugging => Lock
> debugging: prove locking correctness

OK looks like that's under a different section as CONFIG_PROVE_LOCKING
which is enabled on Fedora debug kernels.


# Debug Lockups and Hangs
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# Lock Debugging (spinlocks, mutexes, etc...)
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_LOCK_TORTURE_TEST=m



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Chris Murphy
This is the 2nd or 3rd thread containing hanging btrfs send, with
kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
and the other I can't find at the moment. In that case though the hang
is reproducible in 4.14.x and weirdly it only happens when a snapshot
contains (perhaps many) reflinks. Scrub and check lowmem find nothing
wrong.

I have snapshots with a few reflinks (cp --reflink and also
deduplication), and I see maybe 15-30 second hangs where nothing is
apparently happening (in top or iotop), but I'm also not seeing any
blocked tasks or high CPU usage. Perhaps in my case it's just
recovering quickly.

Are there any kernel config options in "# Debug Lockups and Hangs"
that might hint at what's going on? Some of these are enabled in
Fedora debug kernels, which are built practically daily, e.g. right
now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
translates to git 54eda9df17f3.


Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-09 Thread Chris Murphy
On Sun, Sep 9, 2018 at 2:16 PM, Stefan Loewen  wrote:
> I'm not sure about the exact definition of "blocked" here, but I was
> also surprised that there were no blocked tasks listed since I'm
> definitely unable to kill (SIGKILL) that process.
> On the other hand it wakes up hourly to transfer a few bytes.
> The problem is definitely not, that I issued the sysrq too early. I
> think it was after about 45min of no IO.

Another one the devs have asked for in cases where things get slow or
hang, but without explicit blocked task messages, is sysrq + t. But
I'm throwing spaghetti at a wall at this point, none of it will fix
the problem, and I haven't learned how to read these outputs.



> So there is some problem with this "original" subvol. Maybe I should
> describe how that came into existence.
> Initially I had my data on a NTFS formatted drive. I then created a
> btrfs partition on my second drive and rsynced all my stuff over into
> the root subvol.
> Then I noticed that having all my data in the root subvol was a bad
> idea and created a "data" subvol and reflinked everything into it.
> I deleted the data from the root subvol, made a snapshot of the "data"
> subvol, tried sending that and ran into the problem we're discussing
> here.

That is interesting and useful information. I see nothing invalid
about it at all. However, just for future reference it is possible to
snapshot the top level (default) subvolume.

By default, the top level subvolume (sometimes referred to as
subvolid=5 or subvolid=0) is what is mounted if you haven't used
'btrfs sub set-default' to change it. You can snapshot that subvolume
by snapshotting the mount point. e.g.

mount /dev/sda1 /mnt
btrfs sub snap /mnt/subvolume1

So now you have a readwrite subvolume called "subvolume1" which
contains everything that was in the top level, which you can now
delete if you're trying to keep things tidy and just have subvolumes
and snapshots in the top level.

Anyway, what you did is possibly relevant to the problem. But if it
turns out it's the cause of the problem, it's definitely a bug.


>
> btrfs check in lowmem mode did not find any errors either:
>
> $ sudo btrfs check --mode=lowmem --progress /dev/sdb1
> Opening filesystem to check...
> Checking filesystem on /dev/sdb1
> UUID: cd786597-3816-40e7-bf6c-d585265ad372
> [1/7] checking root items  (0:00:30 elapsed,
> 1047408 items checked)
> [2/7] checking extents (0:03:55 elapsed,
> 309170 items checked)
> cache and super generation don't match, space cache will be invalidated
> [3/7] checking free space cache(0:00:00 elapsed)
> [4/7] checking fs roots(0:04:07 elapsed, 85373
> items checked)
> [5/7] checking csums (without verifying data)  (0:00:00 elapsed,
> 253106 items checked)
> [6/7] checking root refs done with fs roots in lowmem mode, skipping
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 708354711552 bytes used, no error found
> total csum bytes: 689206904
> total tree bytes: 2423865344
> total fs tree bytes: 1542914048
> total extent tree bytes: 129843200
> btree space waste bytes: 299191292
> file data blocks allocated: 31709967417344
> referenced 928531877888

OK good to know.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-08 Thread Chris Murphy
I don't see any blocked tasks. I wonder if you were too fast with
sysrq w? Maybe it takes a little bit for the block task to actually
develop?

I suggest also 'btrfs check --mode=lowmem' because that is a separate
implementation of btrfs check that tends to catch different things
than the original. It is slow, however.

-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Chris Murphy
On Fri, Sep 7, 2018 at 11:07 AM, Stefan Loewen  wrote:
> List of steps:
> - 3.8G iso lays in read-only subvol A
> - I create subvol B and reflink-copy the iso into it.
> - I create a read-only snapshot C of B
> - I "btrfs send --no-data C > /somefile"
> So you got that right, yes.

OK I can't reproduce it. Sending A and C complete instantly with
--no-data, and complete in the same time with a full send/receive. In
my case I used a 4.9G ISO.

I can't think of what local difference accounts for what you're
seeing. There is really nothing special about --reflinks. The extent
and csum data are identical to the original file, and that's the bulk
of the metadata for a given file.

What I can tell you is usually the developers want to see sysrq+w
whenever there are blocked tasks.
https://fedoraproject.org/wiki/QA/Sysrq

You'll want to enable all sysrq functions. And next you'll want three
ssh shells:

1. sudo journalctl -fk
2. sudo -i to become root, and then echo w > /proc/sysrq-trigger but
do not hit return yet
3. sudo btrfs send... to reproduce the problem.

Basically the thing is gonna hang soon after you reproduce the
problem, so you want to get to shell #2 and just hit return rather
than dealing with long delays typing that echo command out. And then
the journal command is so your local terminal captures the sysrq
output because you're gonna kill the VM instead of waiting it out. I
have no idea how to read these things but someone might pick up this
thread and have some idea why these tasks are hanging.




>
> Unfortunately I don't have any way to connect the drive to a SATA port
> directly but I tried to switch out as much of the used setup as
> possible (all changes active at the same time):
> - I got the original (not the clone) HDD out of the enclosure and used
> this adapter to connect it:
> https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK
> - I used a different Notebook
> - I ran the test natively on that notebook (instead of from
> VirtualBox. I used VirtualBox for most of the tests as I have to
> force-poweroff the PC everytime the btrfs-send hangs as it is not
> killable)


This problem only happens in VirtualBox? Or it happens on baremetal
also? And we've established it happens with two different source
(send) devices, which means two different Btrfs volumes.

All I can say is you need to keep changing things up, process of
elimination. Rather tedious. Maybe you could try downloading a Fedora
28 ISO, make a boot stick out of it, and try to reproduce with the
same drives. At least that's an easy way to isolate the OS from the
equation.


-- 
Chris Murphy


Re: compiling btrfs-progs 4.17.1 gives error "reiserfs/misc.h: No such file or directory"

2018-09-07 Thread Chris Murphy
On Fri, Sep 7, 2018 at 3:56 AM, Jürgen Herrmann  wrote:
> Hello!
>
> I'm having a problem with btrfs send which stops after several seconds.
> The process hangs with 100% cpu time on one cpu. The system is still
> responsive to input but no io is happening anymore so the system
> basically becomes unuseable.

What kernel? Latest stable is 4.18.6. but I want to make sure that's
what you're using, someone else has reported btrfs send problems in
another thread with 4.18.5 that sound similar.




-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Chris Murphy
On Fri, Sep 7, 2018 at 6:47 AM, Stefan Loewen  wrote:
> Well... It seems it's not the hardware.
> I ran a long SMART check which ran through without errors and
> reallocation count is still 0.

That only checks the drive, it's an internal test. It doesn't check
anything else, including connections.

Also you do have a log with a read error and a sector LBA reported. So
there is a hardware issue, it could just be transient.


> So I used clonezilla (partclone.btrfs) to mirror the drive to another
> drive (same model).
> Everything copied over just fine. No I/O error im dmesg.
>
> The new disk shows the same behavior.

So now I'm suspicious of USB behavior. Like I said earlier, when I've
got USB enclosed drives connect to my NUC, regardless of file system,
I routinely get hangs and USB resets. I have to connect all of my USB
enclosed drives to a good USB hub, or I have problems.



> So I created another subvolume, reflinked stuff over and found that it
> is enough to reflink one file, create a read-only snapshot and try to
> btrfs-send that. It's not happening with every file, but there are
> definitely multiple different files. The one I tested with is a 3.8GB
> ISO file.
> Even better:
> 'btrfs send --no-data snap-one > /dev/null'
> (snap-one containing just one iso file) hangs as well.

Do you have a list of steps to make this clear? It sounds like first
you copy a 3.8G ISO file to one subvolume, then reflink copy it into
another subvolume, then snapshot that 2nd subvolume, and try to send
the snapshot? But I want to be clear.

I've got piles of reflinked files in snapshots and they send OK,
although like I said I do get sometimes a 15-30 second hang during
sends.

> Still dmesg shows no IO errors, only "INFO: task btrfs-transacti:541
> blocked for more than 120 seconds." with associated call trace.
> btrfs-send reads some MB in the beginning, writes a few bytes and then
> hangs without further IO.
>
> copying the same file without --reflink, snapshotting and sending
> works without problems.
>
> I guess that pretty much eleminates bad sectors and points towards
> some problem with reflinks / btrfs metadata.

That's pretty weird. I'll keep trying and see if I hit this. What
happens if you downgrade to an older kernel? Either 4.14 or 4.17 or
both. The send code is mainly in the kernel, where the receive code is
mainly in user space tools, for this testing you don't need to
downgrade user space tools. If there's a bug here, I expect it's
kernel.




-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 2:16 PM, Stefan Loewen  wrote:

> Data,single: Size:695.01GiB, Used:653.69GiB
> /dev/sdb1 695.01GiB
> Metadata,DUP: Size:4.00GiB, Used:2.25GiB
> /dev/sdb1   8.00GiB
> System,DUP: Size:40.00MiB, Used:96.00KiB


> Does that mean Metadata is duplicated?

Yes. Single copy for data. Duplicate for metadata+system, and there
are no single chunks for metadata/system.

>
> Ok so to summarize and see if I understood you correctly:
> There are bad sectors on disk. Running an extended selftest (smartctl -t
> long) could find those and replace them with spare sectors.

More likely if it finds a persistently failing sector, it will just
record the first failing sector LBA in its log, and then abort. You'll
see this info with 'smartctl -a' or with -x.

It is possible to resume the test using selective option and picking a
4K aligned 512 byte LBA value after the 4K sector with the defect.
Just because only one is reported in dmesg doesn't mean there isn't a
bad one.

It's unlikely the long test is going to actually fix anything, it'll
just give you more ammunition for getting a likely under warranty
device replaced because it really shouldn't have any issues at this
age.


> If it does not I can try calculating the physical (4K) sector number and
> write to that to make the drive notice and mark the bad sector.
> Is there a way to find out which file I will be writing to beforehand?

I'm not sure how to do it easily.

>Or is
> it easier to just write to the sector and then wait for scrub to tell me
> (and the sector is broken anyways)?

If it's a persistent read error, then it's lost. So you might as well
overwrite it. If it's data, scrub will tell you what file is corrupted
(and restore can help you recover the whole file, of course it'll have
a 4K hole of zeros in it). If it's metadata, Btrfs will fix up the 4K
hole with duplicate metadata.

Gotcha is to make certain you've got the right LBA to write to. You
can use dd to test this, by reading the suspect bad sector, and if
you've got the right one, you'll get an I/O error in user space and
dmesg will have a message like before with sector value. Use the dd
skip= flag for reading, but make *sure* you use seek= when writing
*and* make sure you always use bs=4096 count=1 so that if you make a
mistake you limit the damage haha.

>
> For the drive: Not under warranty anymore. It's an external HDD that I had
> lying around for years, mostly unused. Now I wanted to use it as part of my
> small DIY NAS.

Gotcha. Well you can read up on smartctl and smartd, and set it up for
regular extended tests, and keep an eye on rapidly changing values. It
might give you a 50/50 chance of an early heads up before it dies.

I've got an old Hitachi/Apple laptop drive that years ago developed
multiple bad sectors in different zones of the drive. They got
remapped and I haven't had a problem with that drive since. *shrug*
And in fact I did get a discrete error message from the drive for one
of those and Btrfs overwrote that bad sector with a good copy (it's in
a raid1 volume), so working as designed I guess.

Since you didn't get a fix up message from Btrfs, either the whole
thing just got confused with hanging tasks, or it's possible it's a
data block.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen  wrote:
> Output of the commands is attached.

fdisk
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

smart
Sector Sizes: 512 bytes logical, 4096 bytes physical

So clearly the case is lying about the actual physical sector size of
the drive. It's very common. But it means to fix the bad sector by
writing to it, must be a 4K write. A 512 byte write to the reported
LBA, will fail because it is a RMW, and the read will fail. So if you
write to that sector, you'll get a read failure. Kinda confusing. So
you can convert the LBA to a 4K value, and use dd to write to that "4K
LBA" using bs=4096 and a count of 1 but only when you're ready to
lose all 4096 bytes in that sector. If it's data, it's fine. It's the
loss of one file, and scrub will find and report path to file so you
know what was affected.

If it's metadata, it could be a problem. What do you get for 'btrfs fi
us ' for this volume? I'm wondering if DUP metadata is
being used across the board with no single chunks. If so, then you can
zero that sector, and Btrfs will detect the missing metadata in that
chunk on scrub, and fix it up from a copy. But if you only have single
copy metadata, it just depends what's on that block as to how
recoverable or repairable this is.


195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
196 Reallocated_Event_Count -O--CK   252   252   000-0
197 Current_Pending_Sector  -O--CK   252   252   000-0
198 Offline_Uncorrectable   CK   252   252   000-0

Interesting, no complaints there. Unexpected.

11 Calibration_Retry_Count -O--CK   100   100   000-8
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-31

https://kb.acronis.com/content/9136

This is a low hour device, probably still under warranty? I'd get it
swapped out. If you want more ammunition for arguing in favor of a
swap out under warranty you could do

smartctl -t long /dev/sdb

That will take just under 4 hours to run (you can use the drive in the
meantime, but it'll take a bit longer); and then after that

smartctl -x /dev/sdb

And see if it's found a bad sector or updated any of those smart
values for the worse in particular the offline values.




SCT (Get) Error Recovery Control command failed

OK so not configurable, it is whatever it is and we don't know what
that is. Probably one of the really long recoveries.




>
> The broken-sector-theory sounds plausible and is compatible with my new
> findings:
> I suspected the problem to be in one specific directory, let's call it
> "broken_dir".
> I created a new subvolume and copied broken_dir over.
> - If I copied it with cp --reflink, made a snapshot and tried to btrfs-send
> that, it hung
> - If I rsynced broken_dir over I could snapshot and btrfs-send without a
> problem.

Yeah I'm not sure what it is, maybe a data block.

>
> But shouldn't btrfs scrub or check find such errors?

Nope. Btrfs expects the drive to complete the read command, but always
second guesses the content of the read by comparing to checksums. So
if the drive just supplied corrupt data, Btrfs would detect that and
discretely report, and if there's a good copy it would self heal. But
it can't do that because the drive or USB bus also seems to hang in
such a way that a bunch of tasks are also hung, and none of them are
getting a clear pass/fail for the read. It just hangs.

Arguably the device or the link should not hang. So I'm still
wondering if something else is going on, but this is just the most
obvious first problem, and maybe it's being complicated by another
problem we haven't figure out yet. Anyway, once this problem is solve,
it'll become clear if there are additional problems or not.

In my case, I often get usb reset errors when I directly connect USB
3.0 drives to my Intel NUC, but I don't ever get them when plugging
the drive into a dyconn hub. So if you don't already have a hub in
between the drive and the computer, it might be worth considering.
Basically the hub is going to read and completely rewrite the whole
stream that goes through it (in both directions).



-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 10:03 AM, Stefan Löwen  wrote:
> I have one subvolume (rw) and 2 snapshots (ro) of it.
>
> I just tested 'btrfs send  > /dev/null' and that also shows no IO
> after a while but also no significant CPU usage.
> During this I tried 'ls' on the source subvolume and it hangs as well.
> dmesg has some interesting messages I think (see attached dmesg.log)
>

OK you've got a different problem.

[  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
00 08 00 00
[  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
[  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
00 08 00 00
[  188.247048] print_req_error: I/O error, dev sdb, sector 354855120


This is a read error for a specific sector.  So your drive has media
problems. And I think that's the instigating problem here, from which
a bunch of other tasks that depend on one or more reads completing but
never do. But weirdly there also isn't any kind of libata reset. At
least on SATA, by default we see a link reset after a command has not
returned in 30 seconds. That reset would totally clear the drive's
command queue, and then things either can recover or barf. But in your
case, neither happens and it just sits there with hung tasks.

[  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
rd 2, flush 0, corrupt 0, gen 0

And that's the last we really see from Btrfs. After that, it's all
just hung task traces and are rather unsurprising to me.

Drives in USB cases add a whole bunch of complicating factors for
troubleshooting and repair. Including often masking the actual logical
and physical sector size, the min and max IO size, alignment offset,
and all kinds of things. They can have all sorts of bugs. And I'm also
not totally certain about the relationship between the usb reset
messages and the bad sector. As far as I know the only way we can get
a sector LBA expressly noted in dmesg along with the failed read(10)
command, is if the drive has reported back to libata that discrete
error with sense information. So I'm accepting that as a reliable
error, rather than it being something like a cable. But the reset
messages could possibly be something else in addition to that.

Anyway, the central issue is sector 354855120 is having problems. I
can't tell from the trace if it's transient or persistent. Maybe if
it's transient, that would explain how you sometimes get send to start
working again briefly but then it reverts to hanging. What do you get
for:

fdisk -l /dev/sdb
smartctl -x /dev/sdb
smartctl -l sct erc /dev/sdb

Those are all read only commands, nothing is written or changed.



-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen  wrote:
> Update:
> It seems like btrfs-send is not completely hung. It somewhat regularly
> wakes up every hour to transfer a few bytes. I noticed this via a
> periodic 'ls -l' on the snapshot file. These are the last outputs
> (uniq'ed):
>
> -rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
> -rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
> -rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
> -rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
> -rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
> -rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
> -rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot
>
> I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
> have no idea if this is useful) but there are no changes, even during
> the short wakeups.

I have a sort of "me too" here. I definitely see btrfs send just hang
for no apparent reason, but in my case it's for maybe 15-30 seconds.
Not an hour. Looking at top and iotop at the same time as the LED
lights on the drives, there's  definitely zero activity happening. I
can make things happen during this time - like I can read a file or
save a file from/to any location including the send source or receive
destination. It really just behaves as if the send thread is saying
"OK I'm gonna nap now, back in a bit" and then it is.

So what I end up with on drives with a minimum read-write of 80M/s, is
a send receive that's getting me a net of about 30M/s.

I have around 100 snapshots on the source device. How many total
snapshots do you have on your source? That does appear to affect
performance for some things, including send/receive.


-- 
Chris Murphy


nbdkit as a flexible alternative to loopback mounts

2018-09-04 Thread Chris Murphy
https://rwmj.wordpress.com/2018/09/04/nbdkit-as-a-flexible-alternative-to-loopback-mounts/

This is a pretty cool writeup. I can vouch Btrfs will format mount,
write to, scrub, and btrfs check works on an 8EiB (virtual) disk.

The one thing I thought might cause a problem is the ndb device has a
1KiB sector size, but Btrfs (on x86_64) still uses 4096 byte "sector"
and it all seems to work fine despite that.

Anyway, maybe it's useful for some fstests instead of file backed
losetup devices?


-- 
Chris Murphy


Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-04 Thread Chris Murphy
On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier
 wrote:

> Do you have a procedure to copy all subvolumes & skip error ? (I have
> ~200 snapshots)

If they're already read-only snapshots, then script an iteration of
btrfs send receive to a new volume.

Btrfs seed-sprout would be ideal, however in this case I don't think
can help because a.) it's temporarily one file system, which could
mean the corruption is inherited; and b.) I'm not sure it's multiple
device aware, so either the btrfs-tune -S1 might fail on 2+ device
Btrfs volumes, or possibly it insists on a two device sprout in order
to replicate a two device seed.

If they're not already read-only, it's tricky because it sounds like
mounting rw is possibly risky, and taking read only snapshots might
fail anyway. There is no way to make read only snapshots unless the
volume can be written to; and no way to force a rw subvolume to be
treated as if it were read only even if the volume is mounted read
only. And it takes a read only subvolume for send to work.


-- 
Chris Murphy


Re: Does ssd auto detected work for microSD cards?

2018-09-03 Thread Chris Murphy
On Mon, Sep 3, 2018 at 7:53 PM, GWB  wrote:
> Curious instance here, but perhaps this is the expected behaviour:
>
> mount | grep btrfs
> /dev/sdb3 on / type btrfs (rw,ssd,subvol=@)
> /dev/sdb3 on /home type btrfs (rw,ssd,subvol=@home)
> /dev/sde1 on /media/gwb09/btrfs-32G-MicroSDc type btrfs
> (rw,nosuid,nodev,uhelper=udisks2)
>
> This is on an Ubuntu 14 client.
>
> /dev/sdb is indeed an ssd, a Samsung 850 EVO 500Gig, where Ubuntu runs
> on btrfs root.   It appears btrfs did indeed auto detected an ssd
> drive.   However:
>
> /dev/sde is a micro SD card (32Gig Samsung) sitting in a USB 3 card
> reader, inserted into a USB 3 card slot.  But ssh is not detected.
>
> So is that the expected behavior?

cat /sys/block/sde/queue/rotational

That's what Btrfs uses for detection. I'm willing to bet the SD Card
slot is not using the mmc driver, but instead USB and therefore always
treated as a rotational device.


> If not, does it make a difference?
>
> Would it be best to mount an sd card with ssd_spread?

For the described use case, it probably doesn't make much of a
difference. It sounds like these are fairly large contiguous files,
ZFS send files.

I think for both SDXC and eMMC, F2FS is probably more applicable
overall than Btrfs due to its reduced wandering trees problem. But
again for your use case it may not matter much.



> Yet another side note: both btrfs and zfs are now "housed" at Oracle
> (and most of java, correct?).

Not really. The ZFS we care about now is OpenZFS, forked from Oracle's
ZFS. And a bunch of people not related to Oracle do that work. And
Btrfs has a wide assortment of developers: Facebook, SUSE, Fujitsu,
Oracle, and more.


-- 
Chris Murphy


Re: IO errors when building RAID1.... ?

2018-09-03 Thread Chris Murphy
On Mon, Sep 3, 2018 at 4:23 AM, Adam Borowski  wrote:
> On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote:
>> For > 10 years drive firmware handles bad sector remapping internally.
>> It remaps the sector logical address to a reserve physical sector.
>>
>> NTFS and ext[234] have a means of accepting a list of bad sectors, and
>> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
>> and I think even FAT, lack this capability.
>
> 
> FAT entry FF7 (FAT12)/FFF7 (FAT16)/...
> 

Oh yeah even Linux mkdosfs does have -c option to check for bad
sectors and presumably will remove them from use. It doesn't accept a
separate list though, like badblocks + mke2fs.

-- 
Chris Murphy


Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-03 Thread Chris Murphy
ead_io_errs0
> [/dev/sdb2].flush_io_errs   0
> [/dev/sdb2].corruption_errs 0
> [/dev/sdb2].generation_errs 0
>
> device stats report no errors :(
>
> # btrfs fi df /
> Data, RAID1: total=2.32TiB, used=2.23TiB
> System, RAID1: total=96.00MiB, used=368.00KiB
> Metadata, RAID1: total=22.00GiB, used=19.12GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> # btrfs scrub status /
> scrub status for 4917db5e-fc20-4369-9556-83082a32d4cd
> scrub started at Mon Sep  3 05:32:52 2018, interrupted after
> 00:27:35, not running
> total bytes scrubbed: 514.05GiB with 0 errors
>
> I've already tried 2 times to run btrfs scrub (after reboot), but it
> stops before the end, with the previous dmesg error
>
> My question is what is the safest way to rebuild this BTRFS RAID1?
> I haven't tried "btrfs check --repair" yet
> (I can boot on a more up to date Linux live if it helps)

Definitely do not run btrfs check --repair, that's the nearly last resort.

It's vaguely possible this is a bug that's been fixed in a newer
kernel version, so it's worth giving 4.17.x or 4.18.x a shot at it.
That is at least safe.

But I'm suspicious of "BTRFS: error (device sda2) in
btrfs_run_delayed_refs:2930: errno=-5 IO failure" which is usually a
hardware error. But I don't see any hardware related message in the
dmesg snippet provided so you'd need to go through the whole thing
looking for suspicious items why there was an IO failure.

It's clear Btrfs did receive all or part of the leaf, determined it's
corrupt, and the actual mystery is if that double message is for both
drives even though only sda2 is named both times (the first two lines
of your dmesg). There are some kinds of memory related corruption that
newer versions of btrfs-progs can fix. I'm not sure if 4.4 is new
enough, or if the particular corruption you're seeing is something
btrfs check can fix, but I still wouldn't use --repair until Qu or
another dev says to give it a shot.



-- 
Chris Murphy


Re: IO errors when building RAID1.... ?

2018-09-02 Thread Chris Murphy
On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc  wrote:
>
>
> On 08/31/2018 08:52 PM, Chris Murphy wrote:
>>
>>
>> Bad sector which is failing write. This is fatal, there isn't anything
>> the block layer or Btrfs (or ext4 or XFS) can do about it. Well,
>> ext234 do have an option to scan for bad sectors and create a bad
>> sector map which then can be used at mkfs time, and ext234 will avoid
>> using those sectors. And also the md driver has a bad sector option
>> for the same, and does remapping. But XFS and Btrfs don't do that.
>>
>> If the drive is under warranty, get it swapped out, this is definitely
>> a warranty covered problem.
>>
>>
>>
>>
> Thank you very much.
>
> Once upon a time...(I am old), there were lists of bad sectors, and the
> software did avoid wrting in them. It seems to have disappeared. For which
> reason ? Maybe because these errors occur so  rarely, that it is not worth
> the trouble ?

For > 10 years drive firmware handles bad sector remapping internally.
It remaps the sector logical address to a reserve physical sector.

NTFS and ext[234] have a means of accepting a list of bad sectors, and
will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
and I think even FAT, lack this capability. I'm not aware of any file
system that once had bad sector tracking, that has since dropped the
capability.

-- 
Chris Murphy


Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
If you want you can post the output from 'sudo smartctl -x /dev/sda'
which will contain more information... but this is in some sense
superfluous. The problem is very clearly a bad drive, the drive
explicitly report to libata a write error, and included the sector LBA
affected, and only the drive firmware would know that. It's not likely
a cable problem or something like. And that the write error is
reported at all means it's persistent, not transient.


Chris Murphy


Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
 kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 1, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 2, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 3, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 4, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 5, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 6, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 7, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] FAILED Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 00 61
> 9c 00 00 0a 00 00
> Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda,
> sector 6396928
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
>
> more than 100 identical lines...
>
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: ata1: EH complete
> Aug 31 17:36:38 server kernel: ata1.00: detaching (SCSI 0:0:0:0)
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Synchronize Cache(10)
> failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Stopping disk
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Start/Stop Unit failed:
> Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> Aug 31 17:36:38 server kernel: Buffer I/O error on dev sda1, logical block
> 488378352, async page read
> Aug 31 17:36:38 server kernel: scsi 0:0:0:0: rejecting I/O to dead device
> Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda,
> sector 6762624
> Aug 31 17:36:38 server kernel: BTRFS: error (device sda1) in
> btrfs_commit_transaction:2227: errno=-5 IO failure (Error while writing out
> transaction)
> Aug 31 17:36:38 server kernel: BTRFS info (device sda1): forced readonly
> Aug 31 17:36:38 server kernel: BTRFS warning (device sda1): Skipping commit
> of aborted transaction.
> Aug 31 17:36:38 server kernel: [ cut here ]
> Aug 31 17:36:38 server kernel: WARNING: CPU: 1 PID: 159 at
> /build/linux-cRtIym/linux-4.9.30/fs/btrfs/transaction.c:1850
> cleanup_transaction+0x1f0/0x2e0 [btrfs]
> Aug 31 17:36:38 server kernel: BTRFS: Transaction aborted (error -5)
> Aug 31 17:36:38 server kernel: Modules linked in: intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass eeepc_wmi
> asus_wmi crct10dif_pclmul sparse_keymap crc32_pclmul g
>
>


These are hardware problems and aren't related to Btrfs.

>sd 0:0:0:0: [sda] FAILED Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 00 61
> 9c 00 00 0a 00 00
> Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda,
> sector 6396928

Bad sector which is failing write. This is fatal, there isn't anything
the block layer or Btrfs (or ext4 or XFS) can do about it. Well,
ext234 do have an option to scan for bad sectors and create a bad
sector map which then can be used at mkfs time, and ext234 will avoid
using those sectors. And also the md driver has a bad sector option
for the same, and does remapping. But XFS and Btrfs don't do that.

If the drive is under warranty, get it swapped out, this is definitely
a warranty covered problem.




-- 
Chris Murphy


bug: btrfs-progs scrub -R flag doesn't show per device stats

2018-08-31 Thread Chris Murphy
btrfs-progs v4.17.1

man btrfs-scrub:

   -R
   print raw statistics per-device instead of a summary


However, on a two device Btrfs volume, -R does not show per device
statistics. See screenshot:

https://drive.google.com/open?id=1xmt_NHGlNJPc8I0F4_OZxgGe9b3quCnD

Additionally, the description of -d and -R doesn't help me distinquish
between the two. -R says "instead of a summary" so that suggests -d
will summarize but isn't explicitly stated.



-- 
Chris Murphy


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
And also, I'll argue this might have been a btrfs-progs bug as well,
depending on what version was used and the command. Both mkfs and dev
add should not be able to add type code 0x05. At least libblkid
correctly shows that it's 1KiB in size, so really Btrfs should not
succeed at adding this device, it can't put any of the supers in the
correct location.

[chris@f28h ~]$ sudo fdisk -l /dev/loop0
Disk /dev/loop0: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7e255cce

Device   Boot  StartEnd Sectors  Size Id Type
/dev/loop0p12048 206847  204800  100M 83 Linux
/dev/loop0p2  206848 411647  204800  100M 83 Linux
/dev/loop0p3  411648 616447  204800  100M 83 Linux
/dev/loop0p4  616448 821247  204800  100M  5 Extended
/dev/loop0p5  618496 821247  202752   99M 83 Linux

[chris@f28h ~]$ sudo kpartx -a /dev/loop0
[chris@f28h ~]$ lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0 7:00 1G  0 loop
├─loop0p1   253:10   100M  0 part
├─loop0p2   253:20   100M  0 part
├─loop0p3   253:30   100M  0 part
├─loop0p4   253:40 1K  0 part
└─loop0p5   253:5099M  0 part

[chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4
btrfs-progs v4.17.1
See http://btrfs.wiki.kernel.org for more information.

probe of /dev/loop0p4 failed, cannot detect existing filesystem.
ERROR: use the -f option to force overwrite of /dev/loop0p4
[chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 -f
btrfs-progs v4.17.1
See http://btrfs.wiki.kernel.org for more information.

ERROR: mount check: cannot open /dev/loop0p4: No such file or directory
ERROR: cannot check mount status of /dev/loop0p4: No such file or directory
[chris@f28h ~]$


I guess that's a good sign in this case?


Chris Murphy


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
On Thu, Aug 30, 2018 at 9:21 AM, Alberto Bursi  wrote:
>
> On 8/30/2018 11:13 AM, Pierre Couderc wrote:
>> Trying to install a RAID1 on a debian stretch, I made some mistake and
>> got this, after installing on disk1 and trying to add second disk :
>>
>>
>> root@server:~# fdisk -l
>> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disklabel type: dos
>> Disk identifier: 0x2a799300
>>
>> Device Boot StartEndSectors  Size Id Type
>> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>>
>>
>> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disklabel type: dos
>> Disk identifier: 0x9770f6fa
>>
>> Device Boot StartEndSectors  Size Id Type
>> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
>>
>>
>> And :
>>
>> root@server:~# btrfs fi show
>> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>> Total devices 2 FS bytes used 1.10GiB
>> devid1 size 1.82TiB used 4.02GiB path /dev/sda1
>> devid2 size 1.00KiB used 0.00B path /dev/sdb1
>>
>> ...
>>
>> My purpose is a simple RAID1 main fs, with bootable flag on the 2
>> disks in prder to start in degraded mode
>> How to get out ofr that...?
>>
>> Thnaks
>> PC
>
>
> sdb1 is an extended partition, you cannot format an extended partition.
>
> change sdb1 into primary partition or add a logical partition into it.

Ahh you're correct. There is special treatment of 0x05, it's a logical
container with the start address actually pointing to the address
where the EBR is. And that EBR's first record contains the actual real
extended partition information.

So this represents two bugs in the installer:
1. If there's only one partition on a drive, it should be primary by
default, not extended.
2. But if extended, it must point to an EBR, and the EBR must be
created at that location. Obviously since there is no /dev/sdb2, this
EBR is not present.




-- 
Chris Murphy


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc  wrote:
> Trying to install a RAID1 on a debian stretch, I made some mistake and got
> this, after installing on disk1 and trying to add second disk  :
>
>
> root@server:~# fdisk -l
> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x2a799300
>
> Device Boot StartEndSectors  Size Id Type
> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>
>
> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x9770f6fa
>
> Device Boot StartEndSectors  Size Id Type
> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended


Extended partition type is not a problem if you're using GRUB as the
bootloader; other bootloaders may not like this. Strictly speaking the
type code 0x05 is incorrect, GRUB ignores type code, as does the
kernel. GRUB also ignores the active bit (boot flag).


>
>
> And :
>
> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
> Total devices 2 FS bytes used 1.10GiB
> devid1 size 1.82TiB used 4.02GiB path /dev/sda1
> devid2 size 1.00KiB used 0.00B path /dev/sdb1

That's odd; and I know you've moved on from this problem but I would
have liked to see the super for /dev/sdb1 and also the installer log
for what commands were used for partitioning, including mkfs and
device add commands.

For what it's worth, 'btrfs dev add' formats the device being added,
it does not need to be formatted in advance, and also it resizes the
file system properly.



> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in
> prder to start in degraded mode

Good luck with this. The Btrfs archives are full of various
limitations of Btrfs raid1. There is no automatic degraded mount for
Btrfs. And if you persistently ask for degraded mount, you run the
risk of other problems if there's merely a delayed discovery of one of
the devices. Once a Btrfs volume is degraded, it does not
automatically resume normal operation just because the formerly
missing device becomes available.

So... this is flat out not suitable for use cases where you need
unattended raid1 degraded boot.



-- 
Chris Murphy


Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 1:14 PM, Menion  wrote:
> You are correct, indeed in order to cleanup you need
>
> 1) someone realize that snapshots have been created
> 2) apt-brtfs-snapshot is manually installed on the system
>
> Assuming also that the snapshots created during do-release-upgrade are
> managed for auto cleanup

Ha! I should have read all the emails.

Anyway, good sleuthing. I think it's a good idea to file a bug report
on it, so at the least other people can fix it manually.


-- 
Chris Murphy


Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 8:56 AM, Menion  wrote:
> [sudo] password for menion:
> ID  gen top level   path
> --  --- -   
> 257 600627  5   /@
> 258 600626  5   /@home
> 296 599489  5
> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55
> 297 599489  5
> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08
> 298 599489  5
> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30
>
> So, there are snapshots, right?

Yep. So you can use 'sudo btrfs fi du -s ' to get a
report on  how much exclusive space is being used by each of those
snapshots and I'll bet it all adds up to about 10G or whatever you're
missing.

>The time stamp is when I have launched
> do-release-upgrade, but it didn't ask anything about snapshot, neither
> I asked for it.

Yep, not sure what's creating them or what the cleanup policy is (if
there is one). So it's worth asking in an Ubuntu forum what these
snapshots are where they came from and what cleans them up so you
don't run out of space, or otherwise how to configure it if you want
more space just because.

I mean, it's a neat idea. But also it needs to clean up after itself
if for no other reason than to avoid user confusion :-)


> If it is confirmed, how can I remove the unwanted snapshot, keeping
> the current "visible" filesystem contents
> Sorry, I am still learning BTRFS and I would like to avoid mistakes
> Bye

You can definitely use Btrfs specific tools to get rid of the
snapshots and not piss off Btrfs at all. However, if you delete them
behind the back of the thing that created them in the first place, it
might get pissed off if they just suddenly go missing. Sometimes those
tools want to do the cleanups because it's tracking the snapshots and
what their purpose is. So if they just go away, it's like having the
rug pulled out from under them.

Anyway:

'sudo btrfs sub del ' will delete it.


Also, I can't tell you for sure what sort of write amplification Btrfs
contributes in your use case on eMMC compared to F2FS. Btrfs has a
"wandering trees" problem that F2FS doesn't have as big a problem.
It's not a big deal (probably) on other kinds of SSDs like SATA/SAS
and NVMe. But on eMMC? If it were SD Card I'd say you can keep using
Btrfs, and maybe mitigate the wandering trees with compression to
reduce overall writes. But if your eMMC is soldered onto a board, I
might consider F2FS instead. And Btrfs for other things.


-- 
Chris Murphy


Re: DRDY errors are not consistent with scrub results

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN  wrote:
> What I want to achive is that I want to add the problematic disk as
> raid1 and see how/when it fails and how BTRFS recovers these fails.
> While the party goes on, the main system shouldn't be interrupted
> since this is a production system. For example, I would never expect
> to be ended up with such a readonly state while trying to add a disk
> with "unknown health" to the system. Was it somewhat expected?

I don't know. I also can't tell you how LVM or mdraid behave in the
same situation either though. For sure I've come across bug reports
where underlying devices go read only and the file system falls over
totally and developers shrug and say they can't do anything.

This situation is a little different and difficult. You're starting
out with a one drive setup so the profile is single/DUP or
single/single, and that doesn't change when adding. So the 2nd drive
is actually *mandatory* for a brief period of time before you've made
it raid1 or higher. It's a developer question what is the design, and
if this is a bug: maybe the device being added should be written to
with placeholder supers or even just zeros in all the places for 'dev
add' metadata, and only if that succeeds, to then write real updated
supers to all devices. It's possible the 'dev add' presently writes
updated supers to all devices at the same time, and has a brief period
where the state is fragile and if it fails, it goes read only to
prevent damaging the file system.

Anyway, without a call trace, no idea why it ended up read only. So I
have to speculate.


>
> Although we know that disk is about to fail, it still survives.

That's very tenuous rationalization, a drive that rejects even a
single write is considered failed by the md driver. Btrfs is still
very tolerant of this, so if it had successfully added and you were
running in production, you should expect to see thousands of write
errors dumped to the kernel log because Btrfs never ejects a bad drive
still. It keeps trying. And keeps reporting the failures. And all
those errors being logged can end up causing more write demand if the
logs are on the same volume as the failing device, even more errors to
record, and you get an escalating situation with heavy log writing.


> Shouldn't we expect in such a scenario that when system tries to read
> or write some data from/to that BROKEN_DISK and when it recognizes it
> failed, it will try to recover the part of the data from GOOD_DISK and
> try to store that recovered data in some other part of the
> BROKEN_DISK?

Nope. Btrfs can only write supers to fixed locations on the drive,
same as any other file system. Btrfs metadata could possibly go
elsewhere because it doesn't have fixed locations, but Btrfs doesn't
do bad sector tracking. So once it decides metadata goes in location
X, if X reports a write error it will not try to write elsewhere and
insofar as I'm aware ext4 and XFS and LVM and md don't either; md does
have an optional bad block map it will use for tracking bad sectors
and remap to known good sectors. Normally the drive firmware should do
this and when that fails the drive is considered toast for production
purpose

>Or did I misunderstood the whole thing?

Well in a way this is sorta user sabotage. It's a valid test and I'd
say ideally things should fail safely, rather than fall over. But at
the same time it's not wrong for developers to say: "look if you add a
bad device there's a decent chance we're going face plant and go read
only to avoid causing worse problems, so next time you should qualify
the drive before putting it into production."

I'm willing to bet all the other file system devs would say something
like that even if Btrfs devs think something better could happen, it's
probably not a super high priority.




-- 
Chris Murphy


Re: DRDY errors are not consistent with scrub results

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN  wrote:
> I've successfully moved everything to another disk. (The only hard
> part was configuring the kernel parameters, as my root partition was
> on LVM which is on LUKS partition. Here are the notes, if anyone
> needs: 
> https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md)
>
> Now I'm seekin for trouble :) I tried to convert my new system (booted
> with new disk) into raid1 coupled with the problematic old disk. To do
> so, I issued:
>
> sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> /dev/mapper/master-root appears to contain an existing filesystem (btrfs).
> ERROR: use the -f option to force overwrite of /dev/mapper/master-root
> aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f
> ERROR: error adding device '/dev/mapper/master-root': Input/output error
> aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/
> sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system
>
> Now I ended up with a readonly file system. Isn't it possible to add a
> device to a running system?

Yes.

The problem is the 2nd error message:

ERROR: error adding device '/dev/mapper/master-root': Input/output error

So you need to look in dmesg to see what Btrfs kernel messages
occurred at that time. I'm gonna guess it's a failed write. You have a
few of those in the smartctl log output. Any time a write failure
happens, the operation is always fatal regardless of the file system.



-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 7:42 AM, Qu Wenruo  wrote:
>
>
> On 2018/8/28 下午9:29, Larkin Lowrey wrote:
>> On 8/27/2018 10:12 PM, Larkin Lowrey wrote:
>>> On 8/27/2018 12:46 AM, Qu Wenruo wrote:
>>>>
>>>>> The system uses ECC memory and edac-util has not reported any errors.
>>>>> However, I will run a memtest anyway.
>>>> So it should not be the memory problem.
>>>>
>>>> BTW, what's the current generation of the fs?
>>>>
>>>> # btrfs inspect dump-super  | grep generation
>>>>
>>>> The corrupted leaf has generation 2862, I'm not sure how recent did the
>>>> corruption happen.
>>>
>>> generation  358392
>>> chunk_root_generation   357256
>>> cache_generation358392
>>> uuid_tree_generation358392
>>> dev_item.generation 0
>>>
>>> I don't recall the last time I ran a scrub but I doubt it has been
>>> more than a year.
>>>
>>> I am running 'btrfs check --init-csum-tree' now. Hopefully that clears
>>> everything up.
>>
>> No such luck:
>>
>> Creating a new CRC tree
>> Checking filesystem on /dev/Cached/Backups
>> UUID: acff5096-1128-4b24-a15e-4ba04261edc3
>> Reinitialize checksum tree
>> csum result is 0 for block 2412149436416
>> extent-tree.c:2764: alloc_tree_block: BUG_ON `ret` triggered, value -28
>
> It's ENOSPC, meaning btrfs can't find enough space for the new csum tree
> blocks.

Seems bogus, there's >4TiB unallocated.

>Label: none  uuid: acff5096-1128-4b24-a15e-4ba04261edc3
>Total devices 1 FS bytes used 66.61TiB
>devid1 size 72.77TiB used 68.03TiB path /dev/mapper/Cached-Backups
>
>Data, single: total=67.80TiB, used=66.52TiB
>System, DUP: total=40.00MiB, used=7.41MiB
>Metadata, DUP: total=98.50GiB, used=95.21GiB
>GlobalReserve, single: total=512.00MiB, used=0.00B

Even if all metadata is only csum tree, and ~200GiB needs to be
written, there's plenty of free space for it.



-- 
Chris Murphy


Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Chris Murphy
On Tue, Aug 28, 2018 at 3:34 AM, Menion  wrote:
> Hi all
> I have run a distro upgrade on my Ubuntu 16.04 that runs ppa kernel
> 4.17.2 with btrfsprogs 4.17.0
> The root filesystem is BTRFS single created by the Ubuntu Xenial
> installer (so on kernel 4.4.0) on an internal mmc, located in
> /dev/mmcblk0p3
> After the upgrade I have cleaned apt cache and checked the free space,
> the results were odd, following some checks (shrinked), followed by
> more comments:

Do you know if you're using Timeshift? I'm not sure if it's enabled by
default on Ubuntu when using Btrfs, but you may have snapshots.

'sudo btrfs sub list -at /'

That should show all subvolumes (includes snapshots).



> [48479.254106] BTRFS info (device mmcblk0p3): 17 enospc errors during balance

Probably soft enospc errors it was able to work around.


-- 
Chris Murphy


Re: Scrub aborts due to corrupt leaf

2018-08-27 Thread Chris Murphy
On Mon, Aug 27, 2018 at 8:12 PM, Larkin Lowrey
 wrote:
> On 8/27/2018 12:46 AM, Qu Wenruo wrote:
>>
>>
>>> The system uses ECC memory and edac-util has not reported any errors.
>>> However, I will run a memtest anyway.
>>
>> So it should not be the memory problem.
>>
>> BTW, what's the current generation of the fs?
>>
>> # btrfs inspect dump-super  | grep generation
>>
>> The corrupted leaf has generation 2862, I'm not sure how recent did the
>> corruption happen.
>
>
> generation  358392
> chunk_root_generation   357256
> cache_generation358392
> uuid_tree_generation358392
> dev_item.generation 0
>
> I don't recall the last time I ran a scrub but I doubt it has been more than
> a year.
>
> I am running 'btrfs check --init-csum-tree' now. Hopefully that clears
> everything up.


I'd expect --init-csum-tree on recreates the data csum tree, and will
not assume metadata leaf is correct and just recompute a csum for it.


-- 
Chris Murphy


Re: DRDY errors are not consistent with scrub results

2018-08-27 Thread Chris Murphy
On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN  wrote:
> Thanks for your guidance, I'll get the device replaced first thing in
> the morning.
>
> Here is balance results which I think resulted not too bad:
>
> sudo btrfs balance start /mnt/peynir/
> WARNING:
>
> Full balance without filters requested. This operation is very
> intense and takes potentially very long. It is recommended to
> use the balance filters to narrow down the balanced data.
> Use 'btrfs balance start --full-balance' option to skip this
> warning. The operation will start in 10 seconds.
> Use Ctrl-C to stop it.
> 10 9 8 7 6 5 4 3 2 1
> Starting balance without any filters.
> Done, had to relocate 18 out of 18 chunks
>
> I suppose this means I've not lost any data, but I'm very prone to due
> to previous `smartctl ...` results.


OK so nothing fatal anyway. We'd have to see any kernel messages that
appeared during the balance to see if there were read or write errors,
but presumably any failure means the balance fails so... might get you
by for a while actually.







-- 
Chris Murphy


Re: DRDY errors are not consistent with scrub results

2018-08-27 Thread Chris Murphy
On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy  wrote:

>> Metadata,single: Size:8.00MiB, Used:0.00B
>>/dev/mapper/master-root 8.00MiB
>>
>> Metadata,DUP: Size:2.00GiB, Used:562.08MiB
>>/dev/mapper/master-root 4.00GiB
>>
>> System,single: Size:4.00MiB, Used:0.00B
>>/dev/mapper/master-root 4.00MiB
>>
>> System,DUP: Size:32.00MiB, Used:16.00KiB
>>/dev/mapper/master-root64.00MiB
>>
>> Unallocated:
>>/dev/mapper/master-root   915.24GiB
>
>
> OK this looks like it maybe was created a while ago, it has these
> empty single chunk items that was common a while back. There is a low
> risk to clean it up, but I still advise backup first:
>
> 'btrfs balance start -mconvert=dup '

You can skip this advise now, it really doesn't matter. But future
Btrfs shouldn't have both single and DUP chunks like this one is
showing, if you're using relatively recent btrfs-progs to create the
file system.


-- 
Chris Murphy


Re: DRDY errors are not consistent with scrub results

2018-08-27 Thread Chris Murphy
On Mon, Aug 27, 2018 at 6:05 PM, Cerem Cem ASLAN  wrote:
> Note that I've directly received this reply, not by mail list. I'm not
> sure this is intended or not.

I intended to do Reply to All but somehow this doesn't always work out
between the user and Gmail, I'm just gonna assume gmail is being an
asshole again.


> Chris Murphy , 28 Ağu 2018 Sal, 02:25
> tarihinde şunu yazdı:
>>
>> On Mon, Aug 27, 2018 at 4:51 PM, Cerem Cem ASLAN  
>> wrote:
>> > Hi,
>> >
>> > I'm getting DRDY ERR messages which causes system crash on the server:
>> >
>> > # tail -n 40 /var/log/kern.log.1
>> > Aug 24 21:04:55 aea3 kernel: [  939.228059] lxc-bridge: port
>> > 5(vethI7JDHN) entered disabled state
>> > Aug 24 21:04:55 aea3 kernel: [  939.300602] eth0: renamed from vethQ5Y2OF
>> > Aug 24 21:04:55 aea3 kernel: [  939.328245] IPv6: ADDRCONF(NETDEV_UP):
>> > eth0: link is not ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328453] IPv6:
>> > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328474] IPv6:
>> > ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready
>> > Aug 24 21:04:55 aea3 kernel: [  939.328491] lxc-bridge: port
>> > 5(vethI7JDHN) entered blocking state
>> > Aug 24 21:04:55 aea3 kernel: [  939.328493] lxc-bridge: port
>> > 5(vethI7JDHN) entered forwarding state
>> > Aug 24 21:04:59 aea3 kernel: [  943.085647] cgroup: cgroup2: unknown
>> > option "nsdelegate"
>> > Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too
>> > long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to
>> > 79750
>> > Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too
>> > long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to
>> > 63750
>> > Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown
>> > option "nsdelegate"
>> > Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect,
>> > device number 2
>> > Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port
>> > 4(vethCTKU4K) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port
>> > 4(vethO59BPD) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left
>> > promiscuous mode
>> > Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port
>> > 4(vethO59BPD) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port
>> > 4(vethBAYODL) entered blocking state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port
>> > 4(vethBAYODL) entered disabled state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered
>> > promiscuous mode
>> > Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6:
>> > ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready
>> > Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port
>> > 4(vethBAYODL) entered blocking state
>> > Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port
>> > 4(vethBAYODL) entered forwarding state
>> > Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid
>> > 5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3
>> > Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ 
>> > DMA
>> > Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR }
>> > Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask
>> > 0x0 SAct 0x0 SErr 0x0 action 0x0
>> > Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25
>> > Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ 
>> > DMA
>> > Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd
>> > c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in
>> > Aug 26 02:18:56 aea3 kernel: [106180.648706]  res
>> > 51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error)
>> > Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR }
>> > Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC }
>>
>> Classic case of uncorrectable read error due to sector failure.
>>
>>
>>
>> > Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for 
>> > UDMA/133
>> > Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0
>> > FAILED Result: hostbyte=DID_OK driverbyte

Re: Device Delete Stalls

2018-08-23 Thread Chris Murphy
And by 4.14 I actually mean 4.14.60 or 4.14.62 (based on the
changelog). I don't think the single patch in 4.14.62 applies to your
situation.


Re: Device Delete Stalls

2018-08-23 Thread Chris Murphy
On Thu, Aug 23, 2018 at 8:04 AM, Stefan Malte Schumacher
 wrote:
> Hallo,
>
> I originally had RAID with six 4TB drives, which was more than 80
> percent full. So now I bought
> a 10TB drive, added it to the Array and gave the command to remove the
> oldest drive in the array.
>
>  btrfs device delete /dev/sda /mnt/btrfs-raid
>
> I kept a terminal with "watch btrfs fi show" open and It showed that
> the size of /dev/sda had been set to zero and that data was being
> redistributed to the other drives. All seemed well, but now the
> process stalls at 8GB being left on /dev/sda/. It also seems that the
> size of the drive has been reset the original value of 3,64TiB.
>
> Label: none  uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8
> Total devices 7 FS bytes used 8.07TiB
> devid1 size 3.64TiB used 8.00GiB path /dev/sda
> devid2 size 3.64TiB used 2.73TiB path /dev/sdc
> devid3 size 3.64TiB used 2.73TiB path /dev/sdd
> devid4 size 3.64TiB used 2.73TiB path /dev/sde
> devid5 size 3.64TiB used 2.73TiB path /dev/sdf
> devid6 size 3.64TiB used 2.73TiB path /dev/sdg
> devid7 size 9.10TiB used 2.50TiB path /dev/sdb
>
> I see no more btrfs worker processes and no more activity in iotop.
> How do I proceed? I am using a current debian stretch which uses
> Kernel 4.9.0-8 and btrfs-progs 4.7.3-1.
>
> How should I proceed? I have a Backup but would prefer an easier and
> less time-comsuming way out of this mess.

I'd let it keep running as long as you can tolerate it. In the
meantime, update your backups, and keep using the file system
normally, it should be safe to use. The block group migration can
sometimes be slow with "brfs dev del" compared to the replace
operation, I can't explain why but it might be related to some
combination of file and free space fragmentation as well as number of
snapshots, and just general complexity of what is effectively a
partial balance operation going on.

Next, you could do a sysrq + t, which dumps process state into the
kernel message buffer which might not be big enough to contain the
output. If you're using systemd, the journal -k will have it, and
presumably syslog's messages will have it. I can't parse this output
but a developer might find it useful to see what's going on and if
it's just plain wrong. Or if it's just slow.

Next, once you get sick of waiting, well you can force a reboot with
'reboot -f' or 'sysrq + b' but then what's the plan? Sure you could
just try again but I don't know that this should give different
results. It's either just slow, or it's a bug. And if it's a bug,
maybe it's fixed in something newer, in which case I'd try a much
newer kernel 4.14 at the oldest, and ideally 4.18.4, at least to
finish off this task.

For what it's worth, the bulk of the delete operation is like a
filtered balance, it's mainly relocating block groups, and that is
supposed to be COW. So it should be safe to do an abrupt reboot. If
you're not writing new information there's no information to lose; the
worst case is Btrfs has a slightly older superblock than the latest
generation for block group relocation and it starts from that point
again. I've done quite a lot of jerkface reboot -f and sysrq + b with
Btrfs and have never broken a file system so far (power failures,
different story) but maybe I'm lucky and I have a bunch of well
behaved devices.



-- 
Chris Murphy


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Chris Murphy
On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted:
>
>> But whether data is shared or exclusive seems potentially ephemeral, and
>> not something a sysadmin should even be able to anticipate let alone
>> individual users.
>
> Define "user(s)".

The person who is saving their document on a network share, and
they've never heard of Btrfs.


> Arguably, in the context of btrfs tool usage, "user" /is/ the admin,

I'm not talking about btrfs tools. I'm talking about rational,
predictable behavior of a shared folder.

If I try to drop a 1GiB file into my share and I'm denied, not enough
free space, and behind the scenes it's because of a quota limit, I
expect I can delete *any* file(s) amounting to create 1GiB free space
and then I'll be able to drop that file successfully without error.

But if I'm unwittingly deleting shared files, my quota usage won't go
down, and I still can't save my file. So now I somehow need a secret
incantation to discover only my exclusive files and delete enough of
them in order to save this 1GiB file. It's weird, it's unexpected, I
think it's a use case failure. Maybe Btrfs quotas isn't meant to work
with samba or NFS shares. *shrug*



>
> "Regular users" as you use the term, that is the non-admins who just need
> to know how close they are to running out of their allotted storage
> resources, shouldn't really need to care about btrfs tool usage in the
> first place, and btrfs commands in general, including btrfs quota related
> commands, really aren't targeted at them, and aren't designed to report
> the type of information they are likely to find useful.  Other tools will
> be more appropriate.

I'm not talking about any btrfs commands or even the term quota for
regular users. I'm talking about saving a file, being denied, and how
does the user figure out how to free up space?

Anyway, it's a hypothetical scenario. While I have Samba running on a
Btrfs volume with various shares as subvolumes, I don't have quotas
enabled.



-- 
Chris Murphy


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Chris Murphy
duplicated, is out of scope for the
user. And we can't have quotas getting busted all of a sudden because
the sysadmin decides to do -dconvert -mconvert raid1, without
requiring the sysadmin to double everyone's quota before performing
the operation.





>
>>
>>
>> In short: values representing quotas are user-oriented ("the numbers one
>> bought"), not storage-oriented ("the numbers they actually occupy").
>
> Well, if something is not possible or brings so big performance impact,
> there will be no argument on how it should work in the first place.

Yep!

What is VFS disk quotas and does Btrfs use that at all? If not, why
not? It seems to me there really should be a high level basic per
directory quota implementation at the VFS layer, with a single kernel
interface as well as a single user space interface, regardless of the
file system. Additional file system specific quota features can of
course have their own tools, but all of this re-invention of the wheel
for basic directory quotas is a mystery to me.


-- 
Chris Murphy


mount shows incorrect subvol when backed by bind mount

2018-08-09 Thread Chris Murphy
I've got another example of bind mounts resulting in confusing
(incorrect) information in the mount command with Btrfs. In this case,
it's Docker using bind mounts.

Full raw version (expires in 7 days)
https://paste.fedoraproject.org/paste/r8tr-3nuvoycwxf0bPUrmA/raw

Relevant portion:

mount shows:

/dev/mmcblk0p3 on /var/lib/docker/containers type btrfs
(rw,noatime,seclabel,compress-force=zstd,ssd,space_cache=v2,subvolid=265,subvol=/root/var/lib/docker/containers)
/dev/mmcblk0p3 on /var/lib/docker/btrfs type btrfs
(rw,noatime,seclabel,compress-force=zstd,ssd,space_cache=v2,subvolid=265,subvol=/root/var/lib/docker/btrfs)

And from the detail fpaste, you can see there is no such subvolume
docker/btrfs or docker/containers - and subvolid=265 is actually for
rootfs.

Anyway, mortals will be confused by this behavior.

-- 
Chris Murphy


Re: Unmountable root partition

2018-07-31 Thread Chris Murphy
On Tue, Jul 31, 2018 at 12:03 PM, Cerem Cem ASLAN  wrote:

> 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo
> Gives the following error:
>
> mount: wrong fs type, bad option, bad superblock on ...
>
> 4. dmesg | tail
> Outputs the following:
>
>
> [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 02 00 00
> 02 00
> [17755.840921] blk_update_request: I/O error, dev sda, sector 507906
> [17755.840941] EXT4-fs (dm-4): unable to read superblock


Are you sure this is the output for the command? Because you're
explicitly asking for type btrfs, which fails, and then the kernel
reports EXT4 superblock unreadable. What do you get if you omit -t
btrfs and just let it autodetect?

But yeah, this is an IO error from the device and there's nothing
Btrfs can do about that unless there is DUP or raid1+ metadata
available.

Is it possible this LV was accidentally reformatted ext4?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ssd vs ssd_spread with sdcard or emmc

2018-07-24 Thread Chris Murphy
Hi,

I'm not finding any recent advice for sdcard or eMMC media, both of
which trigger the ssd mount option automatically. I seem to recall ssd
has had some optimizations recently, but haven't heard much about
ssd_spread.

While sdcard and eMMC are rather different, it seems they have two
things in common: they don't have the wear durability of even consumer
SATA SSD let alone NVMe, and also they both suffer from dog slow
writes. I'm unable to tell that ssd_spread does any better writes wise
on a Samsung EVO+ sdcard. So that leaves wear and in particular if
wandering trees is at all affected by ssd_spread?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
Related on XFS list.

https://www.spinics.net/lists/linux-xfs/msg20722.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Chris Murphy
On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-07-18 13:40, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy 
>> wrote:
>>
>>> I don't know for sure, but based on the addresses reported before and
>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>> as it goes, and hence doesn't end up in enospc?
>>
>>
>> Previous thread is "Problem with file system" from August 2017. And
>> there's these reproduce steps from Austin which have fallocate coming
>> after the dd.
>>
>>  truncate --size=4G ./test-fs
>>  mkfs.btrfs ./test-fs
>>  mkdir ./test
>>  mount -t auto ./test-fs ./test
>>  dd if=/dev/zero of=./test/test bs=65536 count=32768
>>  fallocate -l 2147483650 ./test/test && echo "Success!"
>>
>>
>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>> fallocate in half.
>>
>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>> [chris@f28s btrfs]$ sync
>> [chris@f28s btrfs]$ df -h
>> FilesystemSize  Used Avail Use% Mounted on
>> /dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>
>>
>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>> it, this fails, but I kinda expect that because there's only 1.1G free
>> space. But maybe that's what you're saying is the bug, it shouldn't
>> fail?
>
> Yes, you're right, I had things backwards (well, kind of, this does work on
> ext4 and regular XFS, so it arguably should work here).

I guess I'm confused what it even means to fallocate over a file with
in-use blocks unless either -d or -p options are used. And from the
man page, I don't grok the distinction between -d and -p either. But
based on their descriptions I'd expect they both should work without
enospc.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >