Re: Disk "failed" while doing scrub
Dāvis Mosāns posted on Tue, 14 Jul 2015 04:54:27 +0300 as excerpted: > 2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.dun...@cox.net>: >> You say five disk, but nowhere in your post do you mention what raid >> mode you were using, neither do you post btrfs filesystem show and >> btrfs filesystem df, as suggested on the wiki and which list that >> information. > > Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1 > Using RAID1 for metadata and single for data, with features > big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata > and mounted with noatime,compress=zlib,space_cache,autodefrag Thanks. FWIW, pretty similar here, but running gentoo, now with btrfs- progs v4.1.1 and the mainline 4.2-rc1+ kernel. BTW, note that space_cache has been the default for quite some time, now. I've never actually manually mounted with space_cache on any of my filesystems over several years, now, yet they all report it when I check /proc/mounts, etc. So if you're adding that manually, you can kill that option and save the commandline/fstab space. =:^) > Label: 'Data' uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c >Total devices 5 FS bytes used 7.16TiB >devid1 size 2.73TiB used 2.35TiB path /dev/sdc >devid2 size 1.82TiB used 1.44TiB path /dev/sdd >devid3 size 1.82TiB used 1.44TiB path /dev/sde >devid4 size 1.82TiB used 1.44TiB path /dev/sdg >devid5 size 931.51GiB used 539.01GiB path /dev/sdh > > Data, single: total=7.15TiB, used=7.15TiB > System, RAID1: total=8.00MiB, used=784.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=16.00GiB, used=14.37GiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=512.00MiB, used=0.00B And note that you can easily and quickly remove those empty single-mode system and metadata chunks, which are an artifact of the way mkfs.btrfs works, using balance filters. btrfs balance start -mprofile=single ... should do it. They're actually working on mkfs.btrfs patches to fix it not to do that, right now. There's active patch and testing threads discussing it. Hopefully for btrfs-progs v4.2. (4.1.1 has the patches for single-device and prep work for multi-device, according to the changelog.) >>> Because filesystem still mounts, I assume I should do "btrfs device >>> delete /dev/sdd /mntpoint" and then restore damaged files from backup. >> >> You can try a replace, but with a failing drive still connected, people >> report mixed results. It's likely to fail as it can't read certain >> blocks to transfer them to the new device. > > As I understand, device delete will copy data from that disk and > distribute across rest of disks, while btrfs replace will copy to new > disk which must be atleast size of disk I'm replacing. Sorry. You wrote delete, I read replace. How'd I do that? =:^( You are absolutely correct. Delete would be better here. I guess I had just been reading a thread discussing the problems I mentioned with replace, and saw what I expected to see, not what you actually wrote. >> There's no such partial-file with null-fill tools shipped just yet. > From journal I have only 14 files mentioned where errors occurred. Now > 13 files from them don't throw any errors and their SHA's match to my > backups so they're fine. Good. I was going on the assumption that the questionable device was in much worse shape than that. > And actually btrfs does allow to copy/read that one damaged file, only I > get I/O error when trying to read data from those broken sectors Good, and good to know. Thanks. =:^) > best and correct way to recover a file is using ddrescue I was just going to mention ddrescue. =:^) > $ du -m /tmp/damaged_file 6251/tmp/damaged_file > > so basically only like 8K bytes are unrecoverable from this file. > Probably there could be created some tool which could get even more data > knowing about btrfs. > >> There /is/, however, a command that can be used to either regenerate or >> zero-out the checksum tree. See btrfs check --init-csum-tree. >> > Seems, you can't specify a path/file for it and it's quite destructive > action if you want to get data only about some one specific file. Yes. It's whole-filesystem-all-or-nothing, unfortunately. =:^( > I did scrub second time and this time there aren't that many > uncorrectable errors and also there's no csum_errors so --init-csum-tree > is useless here I think. Agreed. > Most likely previously scrub got that many errors because it still > continued for a bit even if disk didn't respond. Yes. > scrub status [...] >read_errors: 2 >csum_errors: 0 >verify_errors: 0 >no_csum: 89600 >csum_discards: 656214 >super_errors: 0 >malloc_errors: 0 >uncorrectable_errors: 2 >unverified_errors: 0 >corrected_errors: 0 >last_physical: 2590041112576 OK, that matches up with
[PATCH] Revert "btrfs-progs: mkfs: create only desired block groups for single device"
This reverts commit 5f8232e5c8f0b0de0ef426274911385b0e877392. This commit causes a regression: --- $ mkfs.btrfs -f /dev/sda6 $ btrfsck /dev/sda6 Checking filesystem on /dev/sda6 UUID: 2ebb483c-1986-4610-802a-c6f3e6ab4b76 checking extents Chunk[256, 228, 0]: length(4194304), offset(0), type(2) mismatch with block group[0, 192, 4194304]: offset(4194304), objectid(0), flags(34) Chunk[256, 228, 4194304]: length(8388608), offset(4194304), type(4) mismatch with block group[4194304, 192, 8388608]: offset(8388608), objectid(4194304), flags(36) Block group[0, 4194304] (flags = 34) didn't find the relative chunk. Block group[4194304, 8388608] (flags = 36) didn't find the relative chunk. .. --- The commit has the following bug causing the problem. 1) Typo forgets to add meta/data_profile for alloc_chunk. Only meta/data_profile is added to allocate a block group, but not chunk. 2) Type for the first system chunk is impossible to modify yet. The type for the first chunk and its stripe is hard coded into make_btrfs() function. So even we try to modify the type of the block group, we are unable to change the type of the first chunk. Causing the chunk type mismatch problem. The 1st bug can be fixed quite easily but the second is not. The good news is, the last patch "btrfs-progs: mkfs: Cleanup temporary chunk to avoid strange balance behavior." from my patchset can handle it quite well alone. So just revert the patch. New bug fix for btrfsck(err is 0 even chunk/extent tree is corrupted) and new test cases for mkfs will follow soon. Signed-off-by: Qu Wenruo --- mkfs.c | 34 +++--- 1 file changed, 7 insertions(+), 27 deletions(-) diff --git a/mkfs.c b/mkfs.c index ee8a3cb..afecf00 100644 --- a/mkfs.c +++ b/mkfs.c @@ -59,9 +59,8 @@ struct mkfs_allocation { u64 system; }; -static int create_metadata_block_groups(struct btrfs_root *root, - u64 metadata_profile, int mixed, - struct mkfs_allocation *allocation) +static int create_metadata_block_groups(struct btrfs_root *root, int mixed, + struct mkfs_allocation *allocation) { struct btrfs_trans_handle *trans; u64 bytes_used; @@ -74,7 +73,6 @@ static int create_metadata_block_groups(struct btrfs_root *root, root->fs_info->system_allocs = 1; ret = btrfs_make_block_group(trans, root, bytes_used, -metadata_profile | BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_FIRST_CHUNK_TREE_OBJECTID, 0, BTRFS_MKFS_SYSTEM_GROUP_SIZE); @@ -93,7 +91,6 @@ static int create_metadata_block_groups(struct btrfs_root *root, } BUG_ON(ret); ret = btrfs_make_block_group(trans, root, 0, -metadata_profile | BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA, BTRFS_FIRST_CHUNK_TREE_OBJECTID, @@ -110,7 +107,6 @@ static int create_metadata_block_groups(struct btrfs_root *root, } BUG_ON(ret); ret = btrfs_make_block_group(trans, root, 0, -metadata_profile | BTRFS_BLOCK_GROUP_METADATA, BTRFS_FIRST_CHUNK_TREE_OBJECTID, chunk_start, chunk_size); @@ -126,7 +122,7 @@ err: } static int create_data_block_groups(struct btrfs_trans_handle *trans, - struct btrfs_root *root, u64 data_profile, int mixed, + struct btrfs_root *root, int mixed, struct mkfs_allocation *allocation) { u64 chunk_start = 0; @@ -143,7 +139,6 @@ static int create_data_block_groups(struct btrfs_trans_handle *trans, } BUG_ON(ret); ret = btrfs_make_block_group(trans, root, 0, -data_profile | BTRFS_BLOCK_GROUP_DATA, BTRFS_FIRST_CHUNK_TREE_OBJECTID, chunk_start, chunk_size); @@ -1337,8 +1332,6 @@ int main(int ac, char **av) u64 alloc_start = 0; u64 metadata_profile = 0; u64 data_profile = 0; - u64 default_metadata_profile = 0; - u64 default_data_profile = 0; u32 nodesize = max_t(u32, sysconf(_SC_PAGESIZE), BTRFS_MKFS_DEFAULT_NODE_SIZE); u32 sectorsize = 4096; @@ -1697,19 +1690,7 @@ int main(int ac, char **av) } root->fs_info->alloc_start = alloc_start; - if (dev_cnt == 0) { - default_metadata_profile = metadata_profile; - default_
Re: Disk "failed" while doing scrub
2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.dun...@cox.net>: > You say five disk, but nowhere in your post do you mention what raid mode > you were using, neither do you post btrfs filesystem show and btrfs > filesystem df, as suggested on the wiki and which list that information. Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1 Using RAID1 for metadata and single for data, with features big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata and mounted with noatime,compress=zlib,space_cache,autodefrag Label: 'Data' uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c Total devices 5 FS bytes used 7.16TiB devid1 size 2.73TiB used 2.35TiB path /dev/sdc devid2 size 1.82TiB used 1.44TiB path /dev/sdd devid3 size 1.82TiB used 1.44TiB path /dev/sde devid4 size 1.82TiB used 1.44TiB path /dev/sdg devid5 size 931.51GiB used 539.01GiB path /dev/sdh Data, single: total=7.15TiB, used=7.15TiB System, RAID1: total=8.00MiB, used=784.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=16.00GiB, used=14.37GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B >> Because filesystem still mounts, I assume I should do "btrfs device >> delete /dev/sdd /mntpoint" and then restore damaged files from backup. > > You can try a replace, but with a failing drive still connected, people > report mixed results. It's likely to fail as it can't read certain > blocks to transfer them to the new device. As I understand, device delete will copy data from that disk and distribute across rest of disks, while btrfs replace will copy to new disk which must be atleast size of disk I'm replacing. Assuming other existing disks are good, if so, why replace would be preferable over delete? because delete could fail, but replace not? > There's no such partial-file with null-fill tools shipped just yet. > Those files normally simply trigger errors trying to read them, because > btrfs won't let you at them if the checksum doesn't verify. >From journal I have only 14 files mentioned where errors occurred. Now 13 files from them don't throw any errors and their SHA's match to my backups so they're fine. And actually btrfs does allow to copy/read that one damaged file, only I get I/O error when trying to read data from those broken sectors kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [0] tag[0], task [88011c8c9900]: kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0001, slot [0]. kernel: sas: sas_ata_task_done: SAS error 8a kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 kernel: sas: ata9: end_device-7:2: cmd error handler kernel: sas: ata7: end_device-7:0: dev error handler kernel: sas: ata14: end_device-7:7: dev error handler kernel: ata9.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0 kernel: ata9.00: failed command: READ FPDMA QUEUED kernel: ata9.00: cmd 60/00:00:00:33:a1/0f:00:ab:00:00/40 tag 14 ncq 1966080 in res 41/40:00:48:40:a1/00:0f:ab:00:00/00 Emask 0x409 (media error) kernel: ata9.00: status: { DRDY ERR } kernel: ata9.00: error: { UNC } kernel: ata9.00: configured for UDMA/133 kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor] kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4 kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 33 00 00 0f 00 00 kernel: blk_update_request: I/O error, dev sdd, sector 2879471688 kernel: ata9: EH complete kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 but all other sectors can be copied fine $ du -m ./damaged_file 6250 ./damaged_file $ cp ./damaged_file /tmp/ cp: error reading ‘damaged_file’: Input/output error $ du -m /tmp/damaged_file 4335/tmp/damaged_file cp copies first file part correctly, and I verified that both start of file (first 4336M) and end of file (last 1890M) SHA's match backup $ head -c 4336M ./damaged_file | sha256sum e81b20bfa7358c9f5a0ed165bffe43185abc59e35246e52a7be1d43e6b7e040d - $ head -c 4337M ./damaged_file | sha256sum head: error reading ‘./damaged_file’: Input/output error $ tail -c 1890M ./damaged_file | sha256sum 941568f4b614077858cb8c8dd262bb431bf4c45eca936af728ecffc95619cb60 - $ tail -c 1891M ./damaged_file | sha256sum tail: error reading ‘./damaged_file’: Input/output error with dd can also copy almost all file, only using noerror option it excludes those regions from target file rather than filling with nulls so this isn't good for recovery $ dd conv=noerror if=damaged_file of=/tmp/damaged_file dd: error reading ‘damaged_file’: Input/output error 8880328+0 records in 8880328+0 records out 4546727936 bytes (4,5 GB) copied, 69,7282 s, 65,2 MB/s dd: error reading ‘damaged_file’: Input/output error 8930824+0 records in 8930824+0 records out 4572581888 bytes (4,6 GB) copied, 113,648 s, 40,2 MB/s 12801720+0
Re: Can't mount btrfs volume on rbd
Thanks a lot Steve! With this binary dump, we can find out what's the cause of your problem and makes btrfsck handle and repair it. Further more, this provides a good hint on what's going wrong in kernel. I'll start investigating this right now. Thanks, Qu Steve Dainard wrote on 2015/07/13 13:22 -0700: Hi Qu, I ran into this issue again, without pacemaker involved, so I'm really not sure what is triggering this. There is no content at all on this disk, basically it was created with a btrfs filesystem, mounted, and now after some reboots later (and possibly hard resets) won't mount with a stale file handle error. I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you in another email so the attachment doesn't spam the list. Thanks, Steve On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo wrote: Steve Dainard wrote on 2015/06/15 09:19 -0700: Hi Qu, # btrfs --version btrfs-progs v4.0.1 # btrfs check /dev/rbd30 Checking filesystem on /dev/rbd30 UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28 checking extents cmds-check.c:3735: check_owner_ref: Assertion `rec->is_root` failed. btrfs[0x41aee6] btrfs[0x423f5d] btrfs[0x424c99] btrfs[0x4258f6] btrfs(cmd_check+0x14a3)[0x42893d] btrfs(main+0x15d)[0x409c71] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5] btrfs[0x409829] # btrfs-image /dev/rbd30 rbd30.image -c9 # btrfs-image -r rbd30.image rbd30.image.2 # mount rbd30.image.2 temp mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle OK, my assumption are all wrong. I'd better check the debug-tree output more carefully. BTW, the rbd30 is the block device which you took the debug-tree output? If so, would you please do a dd dump of it and send it to me? If it contains important/secret info, just forget this. Maybe I can improve the btrfsck tool to fix it. I have a suspicion this was caused by pacemaker starting ceph/filesystem resources on two nodes at the same time,I haven't been able to replicate the issue after hard poweroff if ceph/btrfs are not being controlled by pacemaker. Did you mean mount the same device on different system? Thanks, Qu Thanks for your help. On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo wrote: The debug result seems valid. So I'm afraid the problem is not in btrfs. Would your please try the following 2 things to eliminate btrfs problems? 1) btrfsck from 4.0.1 on the rbd If assert still happens, please update the image of the volume(dd image), to help us improve btrfs-progs. 2) btrfs-image dump and rebuilt the fs into other place. # btrfs-image -c9 # btrfs-image -r # mount This will dump all metadata from to , and then use to rebuild a image called . If can be mounted, then the metadata in the RBD device is completely OK, and we can make conclusion the problem is not caused by btrfs.(maybe ceph?) BTW, all the commands are recommended to be executed on the device which you get the debug info from. As it's a small and almost empty device, so commands execution should be quite fast on it. Thanks, Qu 在 2015年06月13日 00:09, Steve Dainard 写道: Hi Qu, I have another volume with the same error, btrfs-debug-tree output from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE I'm not sure how to interpret the output, but the exit status is 0 so it looks like btrfs doesn't think there's an issue with the file system. I get the same mount error with options ro,recovery. On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo wrote: Original Message Subject: Can't mount btrfs volume on rbd From: Steve Dainard To: Date: 2015年06月11日 23:26 Hello, I'm getting an error when attempting to mount a volume on a host that was forceably powered off: # mount /dev/rbd4 climate-downscale-CMIP5/ mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale file handle /var/log/messages: Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table # parted /dev/rbd4 print Model: Unknown (unknown) Disk /dev/rbd4: 36.5TB Sector size (logical/physical): 512B/512B Partition Table: loop Disk Flags: Number Start End SizeFile system Flags 1 0.00B 36.5TB 36.5TB btrfs # btrfs check --repair /dev/rbd4 enabling repair mode Checking filesystem on /dev/rbd4 UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e checking extents cmds-check.c:2274: check_owner_ref: Assertion `rec->is_root` failed. btrfs[0x4175cc] btrfs[0x41b873] btrfs[0x41c3fe] btrfs[0x41dc1d] btrfs[0x406922] OS: CentOS 7.1 btrfs-progs: 3.16.2 The btrfs-progs seems quite old, and the above btrfsck error seems quite possible related to the old version. Would you please upgrade btrfs-progs to 4.0 and see what will happen? Hopes it can give better info. BTW, it's a good idea to call btrfs-debug-tree /dev/rbd4 to see the output. Thanks Qu. Ceph: version: 0.94.1/CentOS 7.1 I haven't found any references to 'stale file handle' on btrfs. The underlying block device is ceph rbd, so I've posted to both lists for any feedback. Also once I ref
Re: [GIT PULL] More btrfs bug fixes
On Sun, Jul 12, 2015 at 02:50:47AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > Hi Chris, > > Please consider the following changes for the kernel 4.2 release. All > these patches have been available in the mailing list for some time. > > One of the patches is a fix for a regression in the delayed references > code that landed in 4.2-rc1. Two of them are for issues reported by users > on the list and IRC recently (which I've cc'ed for stable) and the final > one is just a missing update of an inode's on disk size after truncating > a file if the no_holes feature is enabled, which I found some time ago. > > I have rebased them on top of your current integration-4.2 branch, > re-tested them and incorporated any tags people have added through the > mailing list (Reviewed-by, Acked-by). > Thanks Filipe, I've pulled these in along with a few more. I'll test overnight and push out in the morning. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't mount btrfs volume on rbd
Hi Qu, I ran into this issue again, without pacemaker involved, so I'm really not sure what is triggering this. There is no content at all on this disk, basically it was created with a btrfs filesystem, mounted, and now after some reboots later (and possibly hard resets) won't mount with a stale file handle error. I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you in another email so the attachment doesn't spam the list. Thanks, Steve On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo wrote: > > > Steve Dainard wrote on 2015/06/15 09:19 -0700: >> >> Hi Qu, >> >> # btrfs --version >> btrfs-progs v4.0.1 >> # btrfs check /dev/rbd30 >> Checking filesystem on /dev/rbd30 >> UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28 >> checking extents >> cmds-check.c:3735: check_owner_ref: Assertion `rec->is_root` failed. >> btrfs[0x41aee6] >> btrfs[0x423f5d] >> btrfs[0x424c99] >> btrfs[0x4258f6] >> btrfs(cmd_check+0x14a3)[0x42893d] >> btrfs(main+0x15d)[0x409c71] >> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5] >> btrfs[0x409829] >> >> # btrfs-image /dev/rbd30 rbd30.image -c9 >> # btrfs-image -r rbd30.image rbd30.image.2 >> # mount rbd30.image.2 temp >> mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle > > OK, my assumption are all wrong. > > I'd better check the debug-tree output more carefully. > > BTW, the rbd30 is the block device which you took the debug-tree output? > > If so, would you please do a dd dump of it and send it to me? > If it contains important/secret info, just forget this. > > Maybe I can improve the btrfsck tool to fix it. > >> >> I have a suspicion this was caused by pacemaker starting >> ceph/filesystem resources on two nodes at the same time,I haven't >> been able to replicate the issue after hard poweroff if ceph/btrfs are >> not being controlled by pacemaker. > > Did you mean mount the same device on different system? > > Thanks, > Qu > >> >> Thanks for your help. >> >> >> >> On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo >> wrote: >>> >>> The debug result seems valid. >>> So I'm afraid the problem is not in btrfs. >>> >>> Would your please try the following 2 things to eliminate btrfs problems? >>> >>> 1) btrfsck from 4.0.1 on the rbd >>> >>> If assert still happens, please update the image of the volume(dd image), >>> to >>> help us improve btrfs-progs. >>> >>> 2) btrfs-image dump and rebuilt the fs into other place. >>> >>> # btrfs-image -c9 >>> # btrfs-image -r >>> # mount >>> >>> This will dump all metadata from to , >>> and then use to rebuild a image called . >>> >>> If can be mounted, then the metadata in the RBD device is >>> completely OK, and we can make conclusion the problem is not caused by >>> btrfs.(maybe ceph?) >>> >>> BTW, all the commands are recommended to be executed on the device which >>> you >>> get the debug info from. >>> As it's a small and almost empty device, so commands execution should be >>> quite fast on it. >>> >>> Thanks, >>> Qu >>> >>> >>> 在 2015年06月13日 00:09, Steve Dainard 写道: Hi Qu, I have another volume with the same error, btrfs-debug-tree output from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE I'm not sure how to interpret the output, but the exit status is 0 so it looks like btrfs doesn't think there's an issue with the file system. I get the same mount error with options ro,recovery. On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo wrote: > > > > > Original Message > Subject: Can't mount btrfs volume on rbd > From: Steve Dainard > To: > Date: 2015年06月11日 23:26 > >> Hello, >> >> I'm getting an error when attempting to mount a volume on a host that >> was forceably powered off: >> >> # mount /dev/rbd4 climate-downscale-CMIP5/ >> mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale >> file >> handle >> >> /var/log/messages: >> Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table >> >> # parted /dev/rbd4 print >> Model: Unknown (unknown) >> Disk /dev/rbd4: 36.5TB >> Sector size (logical/physical): 512B/512B >> Partition Table: loop >> Disk Flags: >> >> Number Start End SizeFile system Flags >> 1 0.00B 36.5TB 36.5TB btrfs >> >> # btrfs check --repair /dev/rbd4 >> enabling repair mode >> Checking filesystem on /dev/rbd4 >> UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e >> checking extents >> cmds-check.c:2274: check_owner_ref: Assertion `rec->is_root` failed. >> btrfs[0x4175cc] >> btrfs[0x41b873] >> btrfs[0x41c3fe] >> btrfs[0x41dc1d] >> btrfs[0x406922] >> >> >> OS: CentOS 7.1 >> btrfs-progs: 3.16.2 > > > > The btrfs-progs seems quite old, and the above btrfsck error seems > quite > possible related to the old version. > > Would you please upgrade btrfs-prog
Re: Wiki suggestions
Am Mon, 13 Jul 2015 19:21:54 +0200 schrieb Marc Joliet : > OK, I'll make the changes then (sans kernel log). Just a heads up: I accepted the terms of service, but the link goes to a non-existent wiki page. -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup pgpy82XmHTZbA.pgp Description: Digitale Signatur von OpenPGP
Re: Wiki suggestions
Am Mon, 13 Jul 2015 18:30:09 +0200 schrieb David Sterba : > On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote: > > Am Mon, 13 Jul 2015 06:56:17 + (UTC) > > schrieb Duncan <1i5t5.dun...@cox.net>: > > > > > Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted: > > > > > > > I hope it's not out of place, but I have a few suggestions for the Wiki: > > > > > > Just in case it wasn't obvious... The wiki is open to user editing. You > > > can, if you like, get an account and make the changes yourself. =:^) > > > > > > Of course, it's understandable if your reaction to web and wiki > > > technologies is similar to mine, newsgroups and mailing lists (in my case > > > via gmane.org's list2news service, so they too are presented as > > > newsgroups) are your primary domain, and you tend to treat the web as > > > read-only so rarely reply on a web forum, let alone edit a wiki. I've > > > never gotten a wiki account here for that reason, either, or I'd have > > > probably gone ahead and made the suggested changes... > > > > > > But with a bit of luck someone with an existing (or even new) account > > > will be along to make the changes... > > > > It's partially a "read-only" habit, but it's also that I'm just not > > confident > > in deciding whether those actually *are* good suggestions, or put > > differently: > > it's the public face of btrfs, and I don't want to accidentally do > > something to > > "ruin" it (to use some hyperbole). > > All your suggesstions are good, adding more articles/videos/talks should > be easy as there's a section for that already. The news section is > mostly written by me but if you keep your entries consistent with the > rest then it's ok. > > There are a few people who watch over new wiki edits and fix/enhance > them if needed. You can't do too much damage unless you really want to. OK, I'll make the changes then (sans kernel log). -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup pgpNnh3EGP1Rh.pgp Description: Digitale Signatur von OpenPGP
Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote: > Filipe, > Thanks for the explanation. Those reasons were not so obvious for me. > > Would it make sense not to COW the block in case-1, if we are mounted > with "notreelog"? Or, perhaps, to check that the block does not belong > to a log tree? > Hi Alex, The crc rules are the most important, we have to make sure the block isn't changed while it is in flight. Also, think about something like this: transaction write block A, puts pointer to it in the btree, generation Y transaction rewrites block A, same generation Y Later on, we try to read block A again. We find it has the correct crc and the correct generation number, but the contents are actually wrong. > The second case is more difficult. One problem is that > BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block > due to memory pressure (this is what I see happening), we complete the > writeback, release the extent buffer, and pages are evicted from the > page cache of btree_inode. After some time we read the block again > (because we want to modify it in the same transaction), but its header > is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at > this point it should be safe to avoid COW, we will re-COW. > > Would it make sense to have some runtime-only mechanism to lock-out > the write-back for an eb? I.e., if we know that eb is not under > writeback, and writeback is locked out from starting, we can redirty > the block without COW. Then we allow the writeback to start when it > wants to. > > In one of my test runs, btrfs had 6.4GB of metadata (before > raid-induced overhead), but during a particular transaction total of > 10GB of metadata (again, before raid-induced overhead) was written to > disk. (Thisis total of all ebs having > header->generation==curr_transid, not only during commit of the > transaction). This particular run was with "notreelog". > > Machine had 8GB of RAM. Linux allows the btree_inode to grow its > page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). > But even though the used amount of metadata is less than that, this > re-COW'ing of already-COW'ed blocks seems to cause page-cache > trashing... Interesting. We've addressed this in the past with changes to the writepage(s) callback for the btree, basically skipping memory pressure related writeback if there isn't that much dirty. There is a lot of room to improve those decisions, like preferring to write leaves over nodes, especially full leaves that are not likely to change again. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
Filipe, Thanks for the explanation. Those reasons were not so obvious for me. Would it make sense not to COW the block in case-1, if we are mounted with "notreelog"? Or, perhaps, to check that the block does not belong to a log tree? The second case is more difficult. One problem is that BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block due to memory pressure (this is what I see happening), we complete the writeback, release the extent buffer, and pages are evicted from the page cache of btree_inode. After some time we read the block again (because we want to modify it in the same transaction), but its header is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at this point it should be safe to avoid COW, we will re-COW. Would it make sense to have some runtime-only mechanism to lock-out the write-back for an eb? I.e., if we know that eb is not under writeback, and writeback is locked out from starting, we can redirty the block without COW. Then we allow the writeback to start when it wants to. In one of my test runs, btrfs had 6.4GB of metadata (before raid-induced overhead), but during a particular transaction total of 10GB of metadata (again, before raid-induced overhead) was written to disk. (Thisis total of all ebs having header->generation==curr_transid, not only during commit of the transaction). This particular run was with "notreelog". Machine had 8GB of RAM. Linux allows the btree_inode to grow its page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). But even though the used amount of metadata is less than that, this re-COW'ing of already-COW'ed blocks seems to cause page-cache trashing... Thanks, Alex. On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana wrote: > On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas wrote: >> Greetings, >> Looking at the code of should_cow_block(), I see: >> >> if (btrfs_header_generation(buf) == trans->transid && >>!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && >> ... >> So if the extent buffer has been written to disk, and now is changed again >> in the same transaction, we insist on COW'ing it. Can anybody explain why >> COW is needed in this case? The transaction has not committed yet, so what >> is the danger of rewriting to the same location on disk? My understanding >> was that a tree block needs to be COW'ed at most once in the same >> transaction. But I see that this is not the case. > > That logic is there, as far as I can see, for at least 2 obvious reasons: > > 1) fsync/log trees. All extent buffers (tree blocks) of a log tree > have the same transaction id/generation, and you can have multiple > fsyncs (log transaction commits) per transaction so you need to ensure > consistency. If we skipped the COWing in the example below, you would > get an inconsistent log tree at log replay time when the fs is > mounted: > > transaction N start > >fsync inode A start >creates tree block X >flush X to disk >write a new superblock >fsync inode A end > >fsync inode B start >skip COW of X because its generation == current transaction id and > modify it in place >flush X to disk > > == crash === > >write a new superblock >fsync inode B end > > transaction N commit > > 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is > written to disk but instead when we trigger writeback for it. So while > the writeback is ongoing we want to make sure the block's content > isn't concurrently modified (we don't keep the eb write locked to > allow concurrent reads during the writeback). > > All tree blocks that don't belong to a log tree are normally written > only when at the end of a transaction commit. But often, due to memory > pressure for e.g., the VM can call the writepages() callback of the > btree inode to force dirty tree blocks to be written to disk before > the transaction commit. > >> >> I am asking because I am doing some profiling of btrfs metadata work under >> heavy loads, and I see that sometimes btrfs COW's almost twice more tree >> blocks than the total metadata size. >> >> Thanks, >> Alex. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Wiki suggestions
On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote: > Am Mon, 13 Jul 2015 06:56:17 + (UTC) > schrieb Duncan <1i5t5.dun...@cox.net>: > > > Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted: > > > > > I hope it's not out of place, but I have a few suggestions for the Wiki: > > > > Just in case it wasn't obvious... The wiki is open to user editing. You > > can, if you like, get an account and make the changes yourself. =:^) > > > > Of course, it's understandable if your reaction to web and wiki > > technologies is similar to mine, newsgroups and mailing lists (in my case > > via gmane.org's list2news service, so they too are presented as > > newsgroups) are your primary domain, and you tend to treat the web as > > read-only so rarely reply on a web forum, let alone edit a wiki. I've > > never gotten a wiki account here for that reason, either, or I'd have > > probably gone ahead and made the suggested changes... > > > > But with a bit of luck someone with an existing (or even new) account > > will be along to make the changes... > > It's partially a "read-only" habit, but it's also that I'm just not confident > in deciding whether those actually *are* good suggestions, or put differently: > it's the public face of btrfs, and I don't want to accidentally do something > to > "ruin" it (to use some hyperbole). All your suggesstions are good, adding more articles/videos/talks should be easy as there's a section for that already. The news section is mostly written by me but if you keep your entries consistent with the rest then it's ok. There are a few people who watch over new wiki edits and fix/enhance them if needed. You can't do too much damage unless you really want to. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs read-only after btrfs-convert from Ext4 & workaround
On Jul 12, 2015 at 2026 -0600, Chris Murphy appeared and said: > On Sun, Jul 12, 2015 at 7:23 PM, René Pfeiffer wrote: > >... > > Output of uname, btrfs, and the dmesg log is attached. Let me know if you > > need anything else. The old Btrfs is still on another disk, and I can > > extract information from it. > > If you can run 'btrfs check' on it (without repair) using btrfs-progs > 4.0, and 3.19.1, and report the results of each, that would be really > useful. Here we go. Best, René. -- )\._.,--,'``. fL Let GNU/Linux work for you while you take a nap. /, _.. \ _\ (`._ ,. R. Pfeiffer + http://web.luchs.at/ `._.-(,_..'--(,_..'`-.;.' - System administration + Consulting + Teaching - Got mail delivery problems? http://web.luchs.at/information/blockedmail.php Checking filesystem on /dev/mapper/oldcrypt UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb checking extents checking free space cache There is no free space entry for 163242479616-163242483712 There is no free space entry for 163242479616-167537278976 cache appears valid but isnt 162168569856 found 358532382931 bytes used err is -22 total csum bytes: 346644792 total tree bytes: 3568107520 total fs tree bytes: 3061465088 total extent tree bytes: 50102272 btree space waste bytes: 987626752 file data blocks allocated: 355459522560 referenced 354990657536 btrfs-progs v3.19.1 Checking filesystem on /dev/mapper/oldcrypt UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb checking extents checking free space cache block group 162168569856 has wrong amount of free spacefailed to load free space cache for block group 162168569856 checking fs roots root 5 inode 39321856 errors 200, dir isize wrong root 5 inode 40898635 errors 200, dir isize wrong found 358532382931 bytes used err is 1 total csum bytes: 346644792 total tree bytes: 3568107520 total fs tree bytes: 3061465088 total extent tree bytes: 50102272 btree space waste bytes: 987626752 file data blocks allocated: 355459522560 referenced 354990657536 btrfs-progs v4.0 pgp1xPu9NdcZy.pgp Description: PGP signature
[GIT PULL] More btrfs bug fixes
From: Filipe Manana Hi Chris, Please consider the following changes for the kernel 4.2 release. All these patches have been available in the mailing list for some time. One of the patches is a fix for a regression in the delayed references code that landed in 4.2-rc1. Two of them are for issues reported by users on the list and IRC recently (which I've cc'ed for stable) and the final one is just a missing update of an inode's on disk size after truncating a file if the no_holes feature is enabled, which I found some time ago. I have rebased them on top of your current integration-4.2 branch, re-tested them and incorporated any tags people have added through the mailing list (Reviewed-by, Acked-by). Thanks. The following changes since commit 9689457b5b0a2b69874c421a489d3fb50ca76b7b: Btrfs: fix wrong check for btrfs_force_chunk_alloc() (2015-07-01 17:17:22 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git integration-4.2 for you to fetch changes up to cffc3374e567ef42954f3c7070b3fa83f20f9684: Btrfs: fix order by which delayed references are run (2015-07-11 22:36:44 +0100) Filipe Manana (4): Btrfs: fix shrinking truncate when the no_holes feature is enabled Btrfs: fix memory leak in the extent_same ioctl Btrfs: fix list transaction->pending_ordered corruption Btrfs: fix order by which delayed references are run fs/btrfs/extent-tree.c | 13 + fs/btrfs/inode.c | 5 ++--- fs/btrfs/ioctl.c | 4 +++- fs/btrfs/transaction.c | 4 ++-- 4 files changed, 20 insertions(+), 6 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel crash - btrfs check shows "extent buffer leak"; Suggestions?
Last time something happened and I poked at it myself I ended up ruining the pool so I thought I'd ask here before doing anything. I'm not sure if this really indicates that anything needs doing or not. The filesystem will mount like normal. It doesn't look like the core dump was written anywhere but I've never actually looked for it before. I'm still Googling where it might be. [root@san01 ~]# btrfs check /dev/sdi Checking filesystem on /dev/sdi UUID: 6848df32-bd2a-49b8-b3b9-40038f98ef8a checking extents checking free space cache checking fs roots checking csums checking root refs checking quota groups Counts for qgroup id: 352 are different our:referenced 699122253824 referenced compressed 699122253824 disk: referenced 699151855616 referenced compressed 699151855616 diff: referenced -29601792 referenced compressed -29601792 our:exclusive 1279471616 exclusive compressed 1279471616 disk: exclusive 1279471616 exclusive compressed 1279471616 Counts for qgroup id: 844 are different our:referenced 699130273792 referenced compressed 699130273792 disk: referenced 699159875584 referenced compressed 699159875584 diff: referenced -29601792 referenced compressed -29601792 our:exclusive 81920 exclusive compressed 81920 disk: exclusive 81920 exclusive compressed 81920 found 875663138891 bytes used err is 0 total csum bytes: 790806028 total tree bytes: 1950498816 total fs tree bytes: 436994048 total extent tree bytes: 526237696 btree space waste bytes: 439941098 file data blocks allocated: 981391929344 referenced 1086835916800 btrfs-progs v4.1 extent buffer leak: start 3338550263808 len 16384 extent buffer leak: start 3338550165504 len 16384 extent buffer leak: start 3100998254592 len 16384 extent buffer leak: start 3100998270976 len 16384 extent buffer leak: start 3100998287360 len 16384 extent buffer leak: start 3100998303744 len 16384 extent buffer leak: start 3100998320128 len 16384 extent buffer leak: start 3100998336512 len 16384 extent buffer leak: start 3100998352896 len 16384 extent buffer leak: start 3100998369280 len 16384 extent buffer leak: start 3338550149120 len 16384 extent buffer leak: start 3338550132736 len 16384 extent buffer leak: start 2756246339584 len 16384 extent buffer leak: start 2756284366848 len 16384 extent buffer leak: start 3339485298688 len 16384 extent buffer leak: start 3339485347840 len 16384 extent buffer leak: start 3339485429760 len 16384 extent buffer leak: start 3339485446144 len 16384 extent buffer leak: start 3339485462528 len 16384 extent buffer leak: start 3339485528064 len 16384 extent buffer leak: start 333948558 len 16384 extent buffer leak: start 3339485560832 len 16384 extent buffer leak: start 3339488018432 len 16384 extent buffer leak: start 3339489361920 len 16384 extent buffer leak: start 3339504140288 len 16384 extent buffer leak: start 3339504156672 len 16384 extent buffer leak: start 3339504435200 len 16384 extent buffer leak: start 3339504467968 len 16384 extent buffer leak: start 3339504484352 len 16384 extent buffer leak: start 3339505778688 len 16384 extent buffer leak: start 3339505811456 len 16384 extent buffer leak: start 3339507105792 len 16384 extent buffer leak: start 3339507187712 len 16384 extent buffer leak: start 3339507204096 len 16384 extent buffer leak: start 3339518394368 len 16384 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slowdown after one week
Am 13.07.2015 um 13:20 schrieb Austin S Hemmelgarn: > On 2015-07-11 02:46, Stefan Priebe wrote: >> Hi, >> >> while using a 40TB btrfs partition for VM backups. I see a massive >> slowdown after around one week. >> >> The backup task takes usally 2-3 hours. After one week it takes 20 >> hours. If i umount and remount the btrfs volume it takes 2-3 hours again. >> >> Kernel 4.1.1 >> > I've been seeing similar (although much less drastic) slowdowns over > time myself pretty much since I started using BTRFS (IIRC, sometime > around 3.16). If you're not constantly writing to that backup volume, > you might want to consider setting up automounting for it. > Yes but that's awful. it's a bug. It would be very nice if someone involved in the btrfs development can comment on that one. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Add an option to disable automatic chunk reclamation.
Since upgrading to a kernel with the automatic chunk reclamation patches, I've noticed a number of issues with BTRFS that all seem to either be caused by, or are further exacerbated by, this 'feature'. The four big issues I've seen regarding it are: 1. TRIM/DISCARD support is broken as a (partial?) result of this. 2. It appears to expose underlying issues with the defrag code (stuff ending up more fragmented after defrag). 3. Since upgrading to a kernel with this patch, most of the BTRFS filesystems that I have that are very rewrite heavy have gotten very noticeably slower that they were beforehand (and this goes away when I run them on a kernel without auto-reclaim). 4. All of my filesystems are experiencing seemingly non-deterministic delays around most large scale VFS level operations (eg, deleting or relocating lots of files) While I understand that this feature does solve (at least partially) an very real issue with BTRFS, there were a number of people who never had this issue to begin with because we ran regular balance operations on our filesystems. Based on this, I would like to propose that some method be provided to disable auto-reclaim be added. Personally, I would prefer to leave it as the default and have a mount option to disable it. smime.p7s Description: S/MIME Cryptographic Signature
Re: Did btrfs filesystem defrag just make things worse?
On 2015-07-11 11:24, Duncan wrote: I'm not a coder, only a list regular and btrfs user, and I'm not sure on this, but there have been several reports of this nature on the list recently, and I have a theory. Maybe the devs can step in and either confirm or shoot it down. While I am a coder, I'm not a BTRFS developer, so what I say below may still be incorrect. [...trimmed for brevity...] Of course during normal use, files get deleted as well, thereby clearing space in existing chunks. But this space will be fragmented, with a mix of unallocated extents and still remaining files. The allocator will I /believe/ (this is where people who can actually read the code come in) try to use up space in existing chunks before allocating additional space, possibly subject to some reasonable extent minimum size, below which btrfs will simply allocate another chunk. AFAICT, this is in fact the case. 1) Prioritize reduced fragmentation, at the expense of higher data chunk allocation. In the extreme, this would mean always choosing to allocate a new chunk and use it if the file (or remainder of the file not yet defragged) was larger than the largest free extent in existing data chunks. The problem with this is that over time, the number of partially used data chunks goes up as new ones are allocated to defrag into, but sub-1 GiB files that are already defragged are left where they are. Of course a balance can help here, by combining multiple partial chunks into fewer full chunks, but unless a balance is run... 2) Prioritize chunk utilization, at the expense of leaving some fragmentation, despite massive amounts of unallocated space. This is what I've begun to suspect defrag does. With a bunch of free but fragmented space in existing chunks, defrag could actually increase fragmentation, as the space in existing chunks is so fragmented a rewrite is forced to use more, smaller extents, because that's all there is free, until another chunk is allocated. As I mentioned above for normal file allocation, it's quite possible that there's some minimum extent size (greater than the bare minimum 4 KiB block size) where the allocator will give up and allocate a new data chunk, but if so, perhaps this size needs bumped upward, as it seems a bit low, today. If I'm reading the code correctly, defrag does indeed try to avoid allocating a new chunk if at all possible. Meanwhile, there's a number of exacerbating factors to consider as well. * Snapshots and other shared references lock extents in place. Defrag doesn't touch anything but the subvolume it's actually pointed at for the defrag. Other subvolumes and shared-reference files will continue to keep the extents they reference locked in place. And COW will rewrite blocks of a file, but the old reference extent remains locked, until all references to it are cleared -- the entire file (or at least all blocks that were in that extent) must be rewritten, and no snapshots or other references to it remain, before it can be freed. For a few kernel cycles btrfs had snapshot-aware-defrag, but that implementation didn't scale well at all, so it was disabled until it could be rewritten, and that rewrite hasn't occurred yet. So snapshot- aware-defrag remains disabled, and defrag only works on the subvolume it's actually pointed at. As a result, if defrag rewrites a snapshotted file, it actually doubles the space that file takes, as it makes a new copy, breaking the reference link between it and the copy in the snapshot. Of course, with the space not freed up, this will, over time, tend to fragment space that is freed even more heavily. To mitigate this, one can run offline data deduplication (duperemove is the tool I'd suggest for this), although there are caveats to doing that as well. * Chunk reclamation. This is the relatively new development that I think is triggering the surge in defrag not defragging reports we're seeing now. Until quite recently, btrfs could allocate new chunks, but it couldn't, on its own, deallocate empty chunks. What tended to happen over time was that people would find all the filesystem space taken up by empty or mostly empty data chunks, and btrfs would start spitting ENOSPC errors when it needed to allocate new metadata chunks but couldn't, as all the space was in empty data chunks. A balance could fix it, often relatively quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a manual process, btrfs wouldn't do it on its own. Recently the devs (mostly) fixed that, and btrfs will automatically reclaim entirely empty chunks on its own now. It still doesn't reclaim partially empty chunks automatically; a manual rebalance must still be used to combine multiple partially empty chunks into fewer full chunks; but it does well enough to make the previous problem pretty rare -- we don't see the hundreds of GiB of empty data chunks allocated any more, like we used to. Which fixed the one problem, but if my theory is co
Re: slowdown after one week
On 2015-07-11 02:46, Stefan Priebe wrote: Hi, while using a 40TB btrfs partition for VM backups. I see a massive slowdown after around one week. The backup task takes usally 2-3 hours. After one week it takes 20 hours. If i umount and remount the btrfs volume it takes 2-3 hours again. Kernel 4.1.1 I've been seeing similar (although much less drastic) slowdowns over time myself pretty much since I started using BTRFS (IIRC, sometime around 3.16). If you're not constantly writing to that backup volume, you might want to consider setting up automounting for it. smime.p7s Description: S/MIME Cryptographic Signature
Re: Wiki suggestions
Am Mon, 13 Jul 2015 06:56:17 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted: > > > I hope it's not out of place, but I have a few suggestions for the Wiki: > > Just in case it wasn't obvious... The wiki is open to user editing. You > can, if you like, get an account and make the changes yourself. =:^) > > Of course, it's understandable if your reaction to web and wiki > technologies is similar to mine, newsgroups and mailing lists (in my case > via gmane.org's list2news service, so they too are presented as > newsgroups) are your primary domain, and you tend to treat the web as > read-only so rarely reply on a web forum, let alone edit a wiki. I've > never gotten a wiki account here for that reason, either, or I'd have > probably gone ahead and made the suggested changes... > > But with a bit of luck someone with an existing (or even new) account > will be along to make the changes... It's partially a "read-only" habit, but it's also that I'm just not confident in deciding whether those actually *are* good suggestions, or put differently: it's the public face of btrfs, and I don't want to accidentally do something to "ruin" it (to use some hyperbole). However, if somebody gives me the go-ahead, I might just edit the wiki myself (though I don't know enough to be able to edit the kernel news entry ;-) ). -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup pgpbh6wXjKf9C.pgp Description: Digitale Signatur von OpenPGP
Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas wrote: > Greetings, > Looking at the code of should_cow_block(), I see: > > if (btrfs_header_generation(buf) == trans->transid && >!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && > ... > So if the extent buffer has been written to disk, and now is changed again > in the same transaction, we insist on COW'ing it. Can anybody explain why > COW is needed in this case? The transaction has not committed yet, so what > is the danger of rewriting to the same location on disk? My understanding > was that a tree block needs to be COW'ed at most once in the same > transaction. But I see that this is not the case. That logic is there, as far as I can see, for at least 2 obvious reasons: 1) fsync/log trees. All extent buffers (tree blocks) of a log tree have the same transaction id/generation, and you can have multiple fsyncs (log transaction commits) per transaction so you need to ensure consistency. If we skipped the COWing in the example below, you would get an inconsistent log tree at log replay time when the fs is mounted: transaction N start fsync inode A start creates tree block X flush X to disk write a new superblock fsync inode A end fsync inode B start skip COW of X because its generation == current transaction id and modify it in place flush X to disk == crash === write a new superblock fsync inode B end transaction N commit 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is written to disk but instead when we trigger writeback for it. So while the writeback is ongoing we want to make sure the block's content isn't concurrently modified (we don't keep the eb write locked to allow concurrent reads during the writeback). All tree blocks that don't belong to a log tree are normally written only when at the end of a transaction commit. But often, due to memory pressure for e.g., the VM can call the writepages() callback of the btree inode to force dirty tree blocks to be written to disk before the transaction commit. > > I am asking because I am doing some profiling of btrfs metadata work under > heavy loads, and I see that sometimes btrfs COW's almost twice more tree > blocks than the total metadata size. > > Thanks, > Alex. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk "failed" while doing scrub
Dāvis Mosāns posted on Mon, 13 Jul 2015 09:26:05 +0300 as excerpted: > Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd > "failed" and also had some error on other disk (/dev/sdh) You say five disk, but nowhere in your post do you mention what raid mode you were using, neither do you post btrfs filesystem show and btrfs filesystem df, as suggested on the wiki and which list that information. FWIW, btrfs defaults for a multi-device filesystem are raid1 metadata, raid0 data. If you didn't specify raid level at mkfs time, it's very likely that's what you're using. The scrub results seem to support this as if the data had been raid1 or raid10, nearly all the errors should have been correctable by pulling from the second copy. And raid5/6 should have been able to recover from parity, tho this mode is new enough it's still not recommended as the chances of bugs and thus failure to work properly are much higher. So you really should have been using raid1/10 if you wanted device failure tolerance, but you didn't say, and if you're using defaults as seems reasonably likely, your data was raid0, and thus it's likely many/ most files are either gone or damaged beyond repair. (As it happens I have a number of btrfs raid1 data/metadata on a pair of partitioned ssds, with each btrfs on a corresponding partition on both of them, with one of the ssds developing bad sectors and basically slowly failing. But the other member of the raid1 pair is solid and I have backups, as well as a spare I can replace the failing one with when I decide it's time, so I've been letting the bad one stick around due as much as anything to morbid curiosity, watching it slowly fail. So I know exactly how scrub on btrfs raid1 behaves in a bad-sector case, pulling the copy from the good device to overwrite the bad copy with, triggering the device's sector remapping in the process. Despite all the read errors, they've all been correctable, because I'm using raid1 for both data and metadata.) > Because filesystem still mounts, I assume I should do "btrfs device > delete /dev/sdd /mntpoint" and then restore damaged files from backup. You can try a replace, but with a failing drive still connected, people report mixed results. It's likely to fail as it can't read certain blocks to transfer them to the new device. With raid1 or better, physically disconnecting the failing device, and doing a device delete missing (or replace missing, but AFAIK this doesn't work with released versions and I'm not sure if it's even in integration yet, but there are patches on-list that should make it work) can work, but with raid0/single, you can mount with a missing device if you use degraded,ro, but obviously that'll only let you try to copy files off, and you'll likely not have a lot of luck with raid0, with files missing but a bit more luck with single. In the likely raid0/single case, you're best bet is probably to try copying off what you can, and/or restoring from backups. See the discussion below. > Are all affected files listed in journal? there's messages about "x > callbacks suppressed" so I'm not sure and if there aren't how to get > full list of damaged files? > Also I wonder if there are any tools to recover partial file fragments > and reconstruct file? (where missing fragments filled with nulls) > I assume that there's no point in running "btrfs check > --check-data-csum" because scrub already does check that? There's no such partial-file with null-fill tools shipped just yet. Those files normally simply trigger errors trying to read them, because btrfs won't let you at them if the checksum doesn't verify. There /is/, however, a command that can be used to either regenerate or zero-out the checksum tree. See btrfs check --init-csum-tree. Current versions recalculate the csums, older versions (btrfsck as that was before btrfs check) simply zeroed it out. Then you can read the file despite bad checksums, tho you'll still get errors if the block physically cannot be read. There's also btrfs restore, which works on the unmounted filesystem without actually writing to it, copying the files it can read to a new location, which of course has to be a filesystem with enough room to restore the files to, altho it's possible to tell restore to do only specific subdirs, for instance. What I'd recommend depends on how complete and how recent your backup is. If it's complete and recent enough, probably the easiest thing is to simply blow away the bad filesystem and start over, recovering from the backup to a new filesystem. If there's files you'd like to get back that weren't backed up or where the backup is old, since the filesystem is mountable, I'd probably copy everything off it I could. Then, I'd try restore, letting it restore to the same location I had copied to, but NOT using the --overwrite option, so it only wrote any files it could restore that the copy wasn
Re: Can't remove missing device
On 10 July 2015 at 06:05, None None wrote: > According to dmesg sda returns bad data but the smart values for it seem fine. > # smartctl -a /dev/sda ... > SMART Self-test log structure revision number 1 > No self-tests have been logged. [To run self-tests, use: smartctl -t] Run smartctl -t long /dev/sda -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html