Hi everybody,
this is my first ever e-mail to a Linux mailing list, so please forgive me if I
am making some stupid mistake (and, please, tell me how to behave better :-).
For my new home backend (AD, shares, tv server, etc.), I recently set up a
couple of Linux (Debian Buster) VMs based on Xen hypervisor. Xen and the Dom0
and DomU VMs are located on a SSD with LVM. For data storage on a spinning hard
drive (home directories, media, ...), I wanted to try out BTRFS, mainly because
of the snapshot capabilities (and because I am curious). As my old hard drive
was showing signs of decay, I bought a new SATA hard drive (Seagate IronWolf
Pro 10 TB).
I connected the hard drive, booted up, logged into the Dom0, created a single
partition on the drive, set up BTRFS in that partition, created some
subvolumes, and copied my data from the old hard drive to the subvolumes. I
then configured Xen to pass through the complete partition containing BTRFS to
the VM with the file server (Samba). I then logged into the DomU containing the
Samba server, mounted the subvolumes, and created the Samba shares. That went
well without any problems.
After some time (one or two days), I started to notice parent transid failures
wich grew in number. There was no hardware/power supply issue that I am aware
of. Of course, I am concerned about my data, so I did some research but could
not pinpoint a problem. With my limited knowledge about BTRFS, I had several
suspects and tried to resolve resepctive issues:
a) Mistakes during initial setup of BTRFS: I did the BTRFS setup from scratch,
copied the data again. Same result: After some time, there were parent transid
failures again, this time associated with other errors (checksum).
b) Memory issues: I have ECC memory, so that seemed unlikely. I nevertheless
did a Memtest86+ test without any memory issues detected in two passes.
c) I then thought that perhaps the Seagate harddrive was not getting along well
with the setup (there were some older posts regarding write cache issues with
certain types of drives/firmwares). SMART data of the harddrive was o.k. So I
bought a different hard drive (Toshiba Enterprise Capacity, 12 TB) and did the
BTRFS setup from scratch. Same result: Parent transid failures and ctree failed
(see dmesg from Dom0 below).
This brings me to the current situation. I have no idea where these issues come
from and what to do to prevent them. So far, there is no actual risk of data
loss as I have all data backed up. If there is no solution for my problem, I am
considering switching to LVM for data storage as well, but I am a little bit
stubborn and want to give BTRFS a fair chance.
Some additional thoughts I had:
d) Perhaps the complete BTRFS setup (Xen, VMs, pass through the partition,
Samba share) is flawed?
e) Perhaps it is wrong to mount the BTRFS root first in the Dom0 and then
accessing the subvolumes in the DomU?
I have the following questions:
1. Where can these issues come from?
2. What can I do to prevent these issues?
3. If I can stick with BTRFS (in case question 1 and 2 can be answered), can I
rescue the current setup (and how), or should I scrap it and start again?
I appreciate any advice.
Best regards,
Paul Leiber
---
System information:
Linux x 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64
GNU/Linux
btrfs-progs v4.20.1
Label: '' uuid: ----
Total devices 1 FS bytes used 3.54TiB
devid1 size 10.91TiB used 3.57TiB path /dev/sda1
Data, single: total=3.56TiB, used=3.53TiB
System, DUP: total=8.00MiB, used=400.00KiB
Metadata, DUP: total=5.00GiB, used=3.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
dmesg from dom0 which controls the hardware (and the hard drive):
[0.806054] pci :00:17.0: reg 0x24: [mem 0xf7a37000-0xf7a377ff]
[0.806223] pci :00:17.0: PME# supported from D3hot
[0.806487] pci :00:1b.0: [8086:a169] type 01 class 0x060400
[0.806831] pci :00:1b.0: PME# supported from D0 D3hot D3cold
[0.806871] pci :00:1b.0: Intel SPT PCH root port ACS workaround enabled
[0.807091] pci :00:1b.3: [8086:a16a] type 01 class 0x060400
[0.807433] pci :00:1b.3: PME# supported from D0 D3hot D3cold
[0.807473] pci :00:1b.3: Intel SPT PCH root port ACS workaround enabled
[0.807706] pci :00:1d.0: [8086:a118] type 01 class 0x060400
[0.808048] pci :00:1d.0: PME# supported from D0 D3hot D3cold
[0.808088] pci :00:1d.0: Intel SPT PCH root port ACS workaround enabled
[0.808324] pci :00:1f.0: [8086:a149] type 00 class 0x060100
[0.808781] pci :00:1f.2: [8086:a121] type 00 class 0x058000
[0.808840] pci :00:1f.2: reg 0x10: [mem 0xf7a3-0xf7a33fff]
[0.809252] pci :00:1f.4: [8086:a123] type 00 class 0x0c0500
[0.809331] pci :00:1f.4: reg 0x10: [mem 0xf7a36000-0xf7a360ff 64bit]
[0.809422] pci :00:1f.4: reg 0x20: [io 0xf040-