Device Delete Stalls - Conclusion
Hello everybody, first, let me thank everybody for their advice. What I did was close the terminal with the device delete-process running in it and fired it up again. It took about 5 minutes of intensive IO-Usage and the data was redistributed and and the /dev/sda/ removed from the list of drives. I am currently running a scrub, which I hope will find and correct any errors caused by the forceful abort of the device delete process. Many people suggested it: Next time I am going to replace one of the older 4TB drives with a bigger one I will use "replace" directly instead of "add/delete". Yours sincerely Stefan
Re: Device Delete Stalls
And by 4.14 I actually mean 4.14.60 or 4.14.62 (based on the changelog). I don't think the single patch in 4.14.62 applies to your situation.
Re: Device Delete Stalls
On Thu, Aug 23, 2018 at 8:04 AM, Stefan Malte Schumacher wrote: > Hallo, > > I originally had RAID with six 4TB drives, which was more than 80 > percent full. So now I bought > a 10TB drive, added it to the Array and gave the command to remove the > oldest drive in the array. > > btrfs device delete /dev/sda /mnt/btrfs-raid > > I kept a terminal with "watch btrfs fi show" open and It showed that > the size of /dev/sda had been set to zero and that data was being > redistributed to the other drives. All seemed well, but now the > process stalls at 8GB being left on /dev/sda/. It also seems that the > size of the drive has been reset the original value of 3,64TiB. > > Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 > Total devices 7 FS bytes used 8.07TiB > devid1 size 3.64TiB used 8.00GiB path /dev/sda > devid2 size 3.64TiB used 2.73TiB path /dev/sdc > devid3 size 3.64TiB used 2.73TiB path /dev/sdd > devid4 size 3.64TiB used 2.73TiB path /dev/sde > devid5 size 3.64TiB used 2.73TiB path /dev/sdf > devid6 size 3.64TiB used 2.73TiB path /dev/sdg > devid7 size 9.10TiB used 2.50TiB path /dev/sdb > > I see no more btrfs worker processes and no more activity in iotop. > How do I proceed? I am using a current debian stretch which uses > Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. > > How should I proceed? I have a Backup but would prefer an easier and > less time-comsuming way out of this mess. I'd let it keep running as long as you can tolerate it. In the meantime, update your backups, and keep using the file system normally, it should be safe to use. The block group migration can sometimes be slow with "brfs dev del" compared to the replace operation, I can't explain why but it might be related to some combination of file and free space fragmentation as well as number of snapshots, and just general complexity of what is effectively a partial balance operation going on. Next, you could do a sysrq + t, which dumps process state into the kernel message buffer which might not be big enough to contain the output. If you're using systemd, the journal -k will have it, and presumably syslog's messages will have it. I can't parse this output but a developer might find it useful to see what's going on and if it's just plain wrong. Or if it's just slow. Next, once you get sick of waiting, well you can force a reboot with 'reboot -f' or 'sysrq + b' but then what's the plan? Sure you could just try again but I don't know that this should give different results. It's either just slow, or it's a bug. And if it's a bug, maybe it's fixed in something newer, in which case I'd try a much newer kernel 4.14 at the oldest, and ideally 4.18.4, at least to finish off this task. For what it's worth, the bulk of the delete operation is like a filtered balance, it's mainly relocating block groups, and that is supposed to be COW. So it should be safe to do an abrupt reboot. If you're not writing new information there's no information to lose; the worst case is Btrfs has a slightly older superblock than the latest generation for block group relocation and it starts from that point again. I've done quite a lot of jerkface reboot -f and sysrq + b with Btrfs and have never broken a file system so far (power failures, different story) but maybe I'm lucky and I have a bunch of well behaved devices. -- Chris Murphy
Re: Device Delete Stalls
On 2018-08-23 10:04, Stefan Malte Schumacher wrote: Hallo, I originally had RAID with six 4TB drives, which was more than 80 percent full. So now I bought a 10TB drive, added it to the Array and gave the command to remove the oldest drive in the array. btrfs device delete /dev/sda /mnt/btrfs-raid I kept a terminal with "watch btrfs fi show" open and It showed that the size of /dev/sda had been set to zero and that data was being redistributed to the other drives. All seemed well, but now the process stalls at 8GB being left on /dev/sda/. It also seems that the size of the drive has been reset the original value of 3,64TiB. Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 Total devices 7 FS bytes used 8.07TiB devid1 size 3.64TiB used 8.00GiB path /dev/sda devid2 size 3.64TiB used 2.73TiB path /dev/sdc devid3 size 3.64TiB used 2.73TiB path /dev/sdd devid4 size 3.64TiB used 2.73TiB path /dev/sde devid5 size 3.64TiB used 2.73TiB path /dev/sdf devid6 size 3.64TiB used 2.73TiB path /dev/sdg devid7 size 9.10TiB used 2.50TiB path /dev/sdb I see no more btrfs worker processes and no more activity in iotop. How do I proceed? I am using a current debian stretch which uses Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. How should I proceed? I have a Backup but would prefer an easier and less time-comsuming way out of this mess. Not exactly what you asked for, but I do have some advice on how to avoid this situation in the future: If at all possible, use `btrfs device replace` instead of an add/delete cycle. The replace operation requires two things. First, you have to be able to connect the new device to the system while all the old ones except the device you are removing are present. Second, the new device has to be at least as big as the old one. Assuming both conditions are met and you can use replace, it's generally much faster and is a lot more reliable than an add/delete cycle (especially when the array is near full). This is because replace just copies the data that's on the old device directly (or rebuilds it directly if it's not present anymore or corrupted), whereas the add/delete method implicitly re-balances the entire array (which takes a long time and may fail if the array is mostly full). Now, as far as what's actually going on here, I'm unfortunately not quite sure, and therefore I'm really not the best person to be giving advice on how to fix it. I will comment that having info on the allocations for all the devices (not just /dev/sda) would be useful in debugging, but even with that I don't know that I personally can help.
Device Delete Stalls - Addition
Hello everybody, I think this might be useful: root@mars:~# btrfs dev usage /mnt/btrfs-raid/ /dev/sda, ID: 1 Device size: 3.64TiB Device slack: 0.00B Data,RAID1: 7.00GiB Metadata,RAID1: 1.00GiB Unallocated: 3.63TiB Yours sincerely Stefan
Device Delete Stalls
Hallo, I originally had RAID with six 4TB drives, which was more than 80 percent full. So now I bought a 10TB drive, added it to the Array and gave the command to remove the oldest drive in the array. btrfs device delete /dev/sda /mnt/btrfs-raid I kept a terminal with "watch btrfs fi show" open and It showed that the size of /dev/sda had been set to zero and that data was being redistributed to the other drives. All seemed well, but now the process stalls at 8GB being left on /dev/sda/. It also seems that the size of the drive has been reset the original value of 3,64TiB. Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 Total devices 7 FS bytes used 8.07TiB devid1 size 3.64TiB used 8.00GiB path /dev/sda devid2 size 3.64TiB used 2.73TiB path /dev/sdc devid3 size 3.64TiB used 2.73TiB path /dev/sdd devid4 size 3.64TiB used 2.73TiB path /dev/sde devid5 size 3.64TiB used 2.73TiB path /dev/sdf devid6 size 3.64TiB used 2.73TiB path /dev/sdg devid7 size 9.10TiB used 2.50TiB path /dev/sdb I see no more btrfs worker processes and no more activity in iotop. How do I proceed? I am using a current debian stretch which uses Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. How should I proceed? I have a Backup but would prefer an easier and less time-comsuming way out of this mess. Yours Stefan