Re: can't read superblock (but could mount)
Quoting Chris Mason : On Fri, Feb 10, 2012 at 05:18:42PM -0500, Chris Mason wrote: Ok, step one: Pull down the dangerdonteveruse branch of btrfs-progs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git dangerdonteveruse Run btrfs-debug-tree -r /dev/sda1 and send the output here please. Sorry, that's btrfs-debug-tree -R /dev/sda1 # ./btrfs-debug-tree -R /dev/sda1 root tree: 10229936128 level 1 chunk tree: 10364125184 level 1 extent tree key (EXTENT_TREE ROOT_ITEM 0) 10229944320 level 3 device tree key (DEV_TREE ROOT_ITEM 0) 10192654336 level 0 fs tree key (FS_TREE ROOT_ITEM 0) 10103791616 level 3 checksum tree key (CSUM_TREE ROOT_ITEM 0) 10103156736 level 2 data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 10027970560 level 0 btrfs root backup slot 0 tree root gen 157090 block 10229907456 extent root gen 157090 block 10229911552 chunk root gen 124887 block 10364125184 device root gen 124887 block 10192654336 csum root gen 157032 block 10103156736 fs root gen 157032 block 10103791616 15491051520 used 40020631552 total 1 devices btrfs root backup slot 1 tree root gen 157091 block 10229948416 extent root gen 157091 block 10229952512 chunk root gen 124887 block 10364125184 device root gen 124887 block 10192654336 csum root gen 157032 block 10103156736 fs root gen 157032 block 10103791616 15491051520 used 40020631552 total 1 devices btrfs root backup slot 2 tree root gen 157092 block 10229907456 extent root gen 157092 block 10229911552 chunk root gen 124887 block 10364125184 device root gen 124887 block 10192654336 csum root gen 157032 block 10103156736 fs root gen 157032 block 10103791616 15491051520 used 40020631552 total 1 devices btrfs root backup slot 3 tree root gen 157093 block 10229936128 extent root gen 157093 block 10229944320 chunk root gen 124887 block 10364125184 device root gen 124887 block 10192654336 csum root gen 157032 block 10103156736 fs root gen 157032 block 10103791616 15491051520 used 40020631552 total 1 devices total bytes 40020631552 bytes used 15491051520 uuid 9e9886fc-3e60-4c59-a246-727662769ee2 Btrfs Btrfs v0.19 But we it is that the bootloader apparently is able to load (at least) the kernel (and initrd) from the partition? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-raid questions I couldn't find an answer to on the wiki
Phillip Susi posted on Fri, 10 Feb 2012 14:45:43 -0500 as excerpted: > On 1/31/2012 12:55 AM, Duncan wrote: >> Thanks! I'm on grub2 as well. It's is still masked on gentoo, but I >> recently unmasked and upgraded to it, taking advantage of the fact that >> I have two two-spindle md/raid-1s for /boot and its backup to test and >> upgrade one of them first, then the other only when I was satisfied >> with the results on the first set. I'll be using a similar strategy >> for the btrfs upgrades, only most of my md/raid-1s are 4-spindle, with >> two sets, working and backup, and I'll upgrade one set first. > > Why do you want to have a separate /boot partition? Unless you can't > boot without it, having one just makes things more complex/problematic. > If you do have one, I agree that it is best to keep it ext4 not btrfs. For a proper picture of the situation, understand that I don't have an initr*, I build everything I need into the kernel and have module loading disabled, and I keep /boot unmounted except when I'm actually installing an upgrade or reconfiguring. Having a separate /boot means that I can keep it unmounted and thus free from possible random corruption or accidental partial /boot tree overwrite or deletion, most of the time. It also means that I can emerge (build from sources using the gentoo ebuild script provided for the purpose, and install to the live system) a new grub without fear of corrupting what I actually boot from -- the grub system installation and boot installation remain separate. A separate /boot is also more robust in terms of file system corruption -- if something goes wrong with my rootfs, I can simply boot its backup, from a separate /boot that will not have been corrupted. Similarly, if something goes wrong with /boot (or the bios partition), I can switch drives in the BIOS and boot from the backup /boot, then load my usual rootfs. Since I'm working with four drives, and both the working /boot and backup /boot are two-spindle md/raid1, one on one pair, one on the other, I have both hardware redundancy via the second spindle of the raid1, and admin-fatfinger redundancy via the backup. However, the rootfs and its backup are both on quad-spindle md/raid1s, thus giving me four separate physical copies each of rootfs and its backup. Because the disk points at a single bootloader, if /boot is on rootfs, all four would point to either the working rootfs or the backup rootfs, and would update together, so I'd lose the ability to fall back to the backup /boot. (Note that I developed the backup /boot policy and solution back on legacy-grub. Grub2 is rather more flexible, particularly with a reasonably roomy GPT BIOS partition, and since each BIOS partition is installed individually, in theory, if a grub2 update failed, I could point the BIOS at a disk I hadn't installed the BIOS partition update to yet, boot to the limited grub rescue-mode-shell, and point it at the /boot in the backup rootfs to load the normal-mode-shell, menu, and additional grub2 modules as necessary. However, being able to access a full normal-mode-shell grub2 on the backup /boot instead of having to resort to the grub2 rescue-mode-shell to reach the backup rootfs, does have its benefits.) One of the nice things about grub2 normal-mode is that it allows (directory and plain text file) browsing of pretty much anything it has a module for, anywhere on the system. That's a nice thing to be able to do, but it too is much more robust if /boot isn't part of rootfs, and thus, isn't likely to be damaged if the rootfs is. The ability to boot to grub2 and retrieve vital information (even if limited to plain-text file storage) from a system without a working rootfs is a very nice ability to have! So you see, a separate /boot really does have its uses. =:^) >> Meanwhile, you're right about subvolumes. I'd not try them on a btrfs >> /boot, either. (I don't really see the use case for it, for a separate >> /boot, tho there's certainly a case for a /boot subvolume on a btrfs >> root, for people doing that.) > > The Ubuntu installer creates two subvolumes by default when you install > on btrfs: one named @, mounted on /, and one named @home, mounted on > /home. Grub2 handles this well since the subvols have names in the > default root, so grub just refers to /@/boot instead of /boot, and so > on. The apt-btrfs-snapshot package makes apt automatically snapshot the > root subvol so you can revert after an upgrade. This seamlessly causes > grub to go back to the old boot menu without the new kernels too, since > it goes back to reading the old grub.cfg in the reverted root subvol. Thanks for that "real world" example. Subvolumes and particularly snapshots can indeed be quite useful, but I'd be rather leery of having all that on the same master filesystem. Lose it and you've lost everything, snapshots or no snapshots, if there's not bootable backups somewhere.
Hot data Tracking
What happened to the hot data tracking feature in btrfs? There are a lot of old patches from aug 2010, but it looks like the feature has been completly removed from the current version of btrfs. Is this feature still on the roadmap? signature.asc Description: OpenPGP digital signature
Re: can't read superblock (but could mount)
On Fri, Feb 10, 2012 at 05:18:42PM -0500, Chris Mason wrote: > On Fri, Feb 10, 2012 at 08:30:51PM +0100, bt...@nentwig.biz wrote: > > Hi! > > > > I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot). > > When switching to 3.2.5 recently the system fails to boot: > > > > (after udevd) > > /etc/rc.sysinit: line 15: 117 Bus error mountpoint -q /proc > > and so on, no idea. > > > > It used to boot with 3.2.4, but > > > > 1) I obviously had some corruption in the tree, when I tried to delete a > > certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message. > > > > 2) Even while running 3.2.4 I was unable to mount the partition from a > > parallel gentoo or live USB install and I still am: > > > > # mount /dev/sda1 /mnt/arch/ > > mount: /dev/sda1: can't read superblock > > > > The strange thing is: when trying to boot from the partition the > > boot loader (syslinux) is > > obviously still able load the kernel from that partition. > > > > Tried btrfs-zero-log and some deperate other things. Result: I can > > now actually > > execute btrfsck which previously used to fail: > > Ok, step one: > > Pull down the dangerdonteveruse branch of btrfs-progs: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git > dangerdonteveruse > > Run btrfs-debug-tree -r /dev/sda1 and send the output here please. Sorry, that's btrfs-debug-tree -R /dev/sda1 -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: subvolume info in /proc/mounts
> On Sat, Feb 04, 2012 at 08:40:51PM +0200, Nikos Voutsinas wrote: >> There was a patch to include the subvolume mount option into >> /proc/mounts. >> Did that make it into the kernel; >> If not, what is the formal way to find out which subvolume is mounted; > see /proc/self/mountinfo there is 'fs root' column, for bind mounts and btrfs subvolumes is there '/' (and '/' for normal mounts). findmnt(8) uses the path in SOURCE column, for example > /dev/sda1[/subvolume]. Thank you Karel, at least the list now has an answer on how to find the mounted subvolume. For a moment I thought that no one had noticed this part of the question. In any production scenario, the first time you try to explore btrfs subvolume capabilities it becomes obvious that this feature is missing and it is weird how this has overlooked for such a long time. Nikos -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't read superblock (but could mount)
On Fri, Feb 10, 2012 at 08:30:51PM +0100, bt...@nentwig.biz wrote: > Hi! > > I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot). > When switching to 3.2.5 recently the system fails to boot: > > (after udevd) > /etc/rc.sysinit: line 15: 117 Bus error mountpoint -q /proc > and so on, no idea. > > It used to boot with 3.2.4, but > > 1) I obviously had some corruption in the tree, when I tried to delete a > certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message. > > 2) Even while running 3.2.4 I was unable to mount the partition from a > parallel gentoo or live USB install and I still am: > > # mount /dev/sda1 /mnt/arch/ > mount: /dev/sda1: can't read superblock > > The strange thing is: when trying to boot from the partition the > boot loader (syslinux) is > obviously still able load the kernel from that partition. > > Tried btrfs-zero-log and some deperate other things. Result: I can > now actually > execute btrfsck which previously used to fail: Ok, step one: Pull down the dangerdonteveruse branch of btrfs-progs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git dangerdonteveruse Run btrfs-debug-tree -r /dev/sda1 and send the output here please. This block with bad transid is from your FS root. We'll need to a root that matches. But we should be able to patch things up! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs unmountable after failed suspend
On Thu, Feb 09, 2012 at 06:54:42PM -0600, Chester wrote: > Output for btrfs-debug-tree for that specific block in dmesg > > And available here ( http://pastebin.com/AgdvS5JM ) in case the lines > wrap and look ugly > > leaf 653297209344 items 43 free space 673 generation 332442 owner 2 > fs uuid 0f5b2f4f-1aa0-4e6f-b904-e5b4d4588144 > chunk uuid 27536f0d-993b-4da3-85eb-1c9b08c435cb > item 7 key (653284831232 EXTENT_ITEM 4096) itemoff 3551 itemsize 51 > extent refs 1 gen 332218 flags 2 > tree block key (654686961664 a8 4096) level 0 > tree block backref root 2 > item 8 key (653284818944 EXTENT_ITEM 4096) itemoff 3500 itemsize 51 > extent refs 1 gen 332218 flags 2 > tree block key (654687121408 a8 4096) level 0 > tree block backref root 2 Ok, its worth pointing out that you're just one bit away from proper ordering here. While I'm testing out this code to fix key ordering, could you please run memtest86 on your machine? The other fields all look correct. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: honor umask when creating subvol root
Set the subvol root inode permissions based on the current umask. --- fs/btrfs/inode.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..b88e71a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6696,8 +6696,10 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans, int err; u64 index = 0; - inode = btrfs_new_inode(trans, new_root, NULL, "..", 2, new_dirid, - new_dirid, S_IFDIR | 0700, &index); + inode = btrfs_new_inode(trans, new_root, NULL, "..", 2, + new_dirid, new_dirid, + S_IFDIR | (~current_umask() & S_IRWXUGO), + &index); if (IS_ERR(inode)) return PTR_ERR(inode); inode->i_op = &btrfs_dir_inode_operations; -- 1.7.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] COWing writeback pages
On Fri, Feb 10, 2012 at 12:49:50PM -0800, Sage Weil wrote: > On Fri, 10 Feb 2012, Josef Bacik wrote: > > On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote: > > > Hi everyone, > > > > > > The takeaway from the 'stable pages' discussions in the last few > > > workshops > > > was that pages under writeback should remain locked so that subsequent > > > writers don't touch them while they are en route to the disk. This > > > prevents bad checksums and DIF/DIX type failures (whereas previously we > > > didn't really care whether old or new data reached the disk). > > > > > > The fear is/was that anyone subsequently modifying the page will have to > > > wait for writeback io to complete before continuing. I seem to remember > > > somebody (Martin?) saying that in practice, under "real" workloads, that > > > doesn't actually happen, so don't worry about it. (Does anyone remember > > > the details of what testing led to that conclusion?) > > > > > > Anyway, we are seeing what looks like an analogous problem with btrfs, > > > where operations sometimes block waiting for writeback of the btree > > > pages. > > > Although the 'keep rewriting the same page' pattern may not be prevalent > > > in normal file workloads, it does seem to happen with the btrfs btree. > > > > > > The obvious solution seems to be to COW the page if it is under writeback > > > and we want to remodify it. Presumably that can be done just in btrfs, > > > to > > > address the btrfs-specific symptoms we're hitting, but I'm interested in > > > hearing from other folks about whether it's more generally useful VM > > > functionality for other filesystems and other workloads. > > > > > > Unfortunately, we haven't been able to pinpoint the exact scenarios under > > > which this triggers under btrfs. We regularly see long stalls for > > > metadata operations (create() and similar metadata-only operations) that > > > block after btrfs_commit_transaction has "finished" the previous > > > transaction and is doing > > > > > > return filemap_write_and_wait(btree_inode->i_mapping); > > > > > > What we're less clear about is when btrfs will modify the in-memory page > > > in place (and thus wait) versus COWing the page... still digging into > > > this > > > now. > > > > > > > Heh so I'm working on this now, specifically in the heavy create() > > workload, and > > I've just about got it nailed down. A lot of this problem is because we > > rely on > > normal pagecache for our metadata so I'm copying xfs and creating our own > > caching. > > > > The thing is since we have an inode hanging out with normal pagecache pages > > we > > can have multiple people trying to write out dirty pages in our inode at the > > same time, and since it goes through our normal write path we'll end up in > > this > > case where we're waiting on writeback for pages we won't actually end up > > writing > > out. My code will fix this, if we're talking about the same problem ;). > > Oh, I hadn't thought of that... that sounds like a similar but slightly > different problem, since it probably wouldn't correlate with the > filemap_write_and_wait(). As long as we don't have a btree update waiting > on btree writeback, though, both problems should be addressed. > Oh yeah that problem is taken care of, IO is completely seperate from updating, we set the BUF_WRITTEN flag in the header right before writing out so then the thing will be COW'ed if anybody tries to modify it while it's in flight, they won't have to wait or anything. Course now that I think about it thats what should be happening today anyway, so I'm confused about what you are seeing. > In any case, we're definitely interested in checking out the code when > it's ready to share! > Well I've been committing my progress to my git tree so you can check it out, but what's there won't work at all, and what I have (and will commit shortly) works pretty well provided you don't do anything with the tree-log, for some reason I'm screwing something up there and its crashing :). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] COWing writeback pages
On Fri, 10 Feb 2012, Josef Bacik wrote: > On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote: > > Hi everyone, > > > > The takeaway from the 'stable pages' discussions in the last few workshops > > was that pages under writeback should remain locked so that subsequent > > writers don't touch them while they are en route to the disk. This > > prevents bad checksums and DIF/DIX type failures (whereas previously we > > didn't really care whether old or new data reached the disk). > > > > The fear is/was that anyone subsequently modifying the page will have to > > wait for writeback io to complete before continuing. I seem to remember > > somebody (Martin?) saying that in practice, under "real" workloads, that > > doesn't actually happen, so don't worry about it. (Does anyone remember > > the details of what testing led to that conclusion?) > > > > Anyway, we are seeing what looks like an analogous problem with btrfs, > > where operations sometimes block waiting for writeback of the btree pages. > > Although the 'keep rewriting the same page' pattern may not be prevalent > > in normal file workloads, it does seem to happen with the btrfs btree. > > > > The obvious solution seems to be to COW the page if it is under writeback > > and we want to remodify it. Presumably that can be done just in btrfs, to > > address the btrfs-specific symptoms we're hitting, but I'm interested in > > hearing from other folks about whether it's more generally useful VM > > functionality for other filesystems and other workloads. > > > > Unfortunately, we haven't been able to pinpoint the exact scenarios under > > which this triggers under btrfs. We regularly see long stalls for > > metadata operations (create() and similar metadata-only operations) that > > block after btrfs_commit_transaction has "finished" the previous > > transaction and is doing > > > > return filemap_write_and_wait(btree_inode->i_mapping); > > > > What we're less clear about is when btrfs will modify the in-memory page > > in place (and thus wait) versus COWing the page... still digging into this > > now. > > > > Heh so I'm working on this now, specifically in the heavy create() workload, > and > I've just about got it nailed down. A lot of this problem is because we rely > on > normal pagecache for our metadata so I'm copying xfs and creating our own > caching. > > The thing is since we have an inode hanging out with normal pagecache pages we > can have multiple people trying to write out dirty pages in our inode at the > same time, and since it goes through our normal write path we'll end up in > this > case where we're waiting on writeback for pages we won't actually end up > writing > out. My code will fix this, if we're talking about the same problem ;). Oh, I hadn't thought of that... that sounds like a similar but slightly different problem, since it probably wouldn't correlate with the filemap_write_and_wait(). As long as we don't have a btree update waiting on btree writeback, though, both problems should be addressed. In any case, we're definitely interested in checking out the code when it's ready to share! sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] COWing writeback pages
On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote: > Hi everyone, > > The takeaway from the 'stable pages' discussions in the last few workshops > was that pages under writeback should remain locked so that subsequent > writers don't touch them while they are en route to the disk. This > prevents bad checksums and DIF/DIX type failures (whereas previously we > didn't really care whether old or new data reached the disk). > > The fear is/was that anyone subsequently modifying the page will have to > wait for writeback io to complete before continuing. I seem to remember > somebody (Martin?) saying that in practice, under "real" workloads, that > doesn't actually happen, so don't worry about it. (Does anyone remember > the details of what testing led to that conclusion?) > > Anyway, we are seeing what looks like an analogous problem with btrfs, > where operations sometimes block waiting for writeback of the btree pages. > Although the 'keep rewriting the same page' pattern may not be prevalent > in normal file workloads, it does seem to happen with the btrfs btree. > > The obvious solution seems to be to COW the page if it is under writeback > and we want to remodify it. Presumably that can be done just in btrfs, to > address the btrfs-specific symptoms we're hitting, but I'm interested in > hearing from other folks about whether it's more generally useful VM > functionality for other filesystems and other workloads. > > Unfortunately, we haven't been able to pinpoint the exact scenarios under > which this triggers under btrfs. We regularly see long stalls for > metadata operations (create() and similar metadata-only operations) that > block after btrfs_commit_transaction has "finished" the previous > transaction and is doing > > return filemap_write_and_wait(btree_inode->i_mapping); > > What we're less clear about is when btrfs will modify the in-memory page > in place (and thus wait) versus COWing the page... still digging into this > now. > Heh so I'm working on this now, specifically in the heavy create() workload, and I've just about got it nailed down. A lot of this problem is because we rely on normal pagecache for our metadata so I'm copying xfs and creating our own caching. The thing is since we have an inode hanging out with normal pagecache pages we can have multiple people trying to write out dirty pages in our inode at the same time, and since it goes through our normal write path we'll end up in this case where we're waiting on writeback for pages we won't actually end up writing out. My code will fix this, if we're talking about the same problem ;). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-raid questions I couldn't find an answer to on the wiki
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 1/31/2012 12:55 AM, Duncan wrote: > Thanks! I'm on grub2 as well. It's is still masked on gentoo, but > I recently unmasked and upgraded to it, taking advantage of the > fact that I have two two-spindle md/raid-1s for /boot and its > backup to test and upgrade one of them first, then the other only > when I was satisfied with the results on the first set. I'll be > using a similar strategy for the btrfs upgrades, only most of my > md/raid-1s are 4-spindle, with two sets, working and backup, and > I'll upgrade one set first. Why do you want to have a separate /boot partition? Unless you can't boot without it, having one just makes things more complex/problematic. If you do have one, I agree that it is best to keep it ext4 not btrfs. > Meanwhile, you're right about subvolumes. I'd not try them on a > btrfs /boot, either. (I don't really see the use case for it, for > a separate /boot, tho there's certainly a case for a /boot > subvolume on a btrfs root, for people doing that.) The Ubuntu installer creates two subvolumes by default when you install on btrfs: one named @, mounted on /, and one named @home, mounted on /home. Grub2 handles this well since the subvols have names in the default root, so grub just refers to /@/boot instead of /boot, and so on. The apt-btrfs-snapshot package makes apt automatically snapshot the root subvol so you can revert after an upgrade. This seamlessly causes grub to go back to the old boot menu without the new kernels too, since it goes back to reading the old grub.cfg in the reverted root subvol. I have a radically different suggestion you might consider rebuilding your system using. Partition each disk into only two partitions: one for bios_grub, and one for everything else ( or just use MBR and skip the bios_grub partition ). Give the second partitions to mdadm to make a raid10 array out of. If you use a 2x far and 2x offset instead of the default near layout, you will have an array that can still handle any 2 of the 4 drives failing, will have twice the capacity of a 4 way mirror, almost the same sequential read throughput of a 4 way raid0, and about twice the write throughput of a 4 way mirror. Partition that array up and put your filesystems on it. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPNXPnAAoJEJrBOlT6nu75/d8IAJ0fQ3xWPe6SYBY8nj34mcWh ql6C4ieMkd07ZCuymT5ZVhWJhtdc6/Vg7ecWmhYdeu4d1WGp4DvTumEYHVl4ZlRk mT9Lq4SupDL5Dk0nfxZUqY8XnIek3kIG/wgekgdSuLF0J9QFQdCFc25j/idIh0Dy Gk5NJtgKmsTKUQhzPQZxif8nwWVQzQICm5P//FeOQgx8sq7iVdCQHUxlJEPfsL7m CVVMJPVk+524rFTWxLZ4KLbXkNE7nrikg7UMlWBtM5gflkU0Y+bfmZKPGcqBCSSn AId5M5alzjLSLblBqwf8wKpEIiDXBqb6f+bSxqnk5FdKKx5l5lziZyqQM+gnyIo= =ePD3 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
can't read superblock (but could mount)
Hi! I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot). When switching to 3.2.5 recently the system fails to boot: (after udevd) /etc/rc.sysinit: line 15: 117 Bus error mountpoint -q /proc and so on, no idea. It used to boot with 3.2.4, but 1) I obviously had some corruption in the tree, when I tried to delete a certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message. 2) Even while running 3.2.4 I was unable to mount the partition from a parallel gentoo or live USB install and I still am: # mount /dev/sda1 /mnt/arch/ mount: /dev/sda1: can't read superblock The strange thing is: when trying to boot from the partition the boot loader (syslinux) is obviously still able load the kernel from that partition. Tried btrfs-zero-log and some deperate other things. Result: I can now actually execute btrfsck which previously used to fail: # btrfsck /dev/sda1 Extent back ref already exists for 9872289792 parent 0 root 5 leaf parent key incorrect 9872289792 bad block 9872289792 ref mismatch on [9872289792 4096] extent item 1, found 2 Incorrect global backref count on 9872289792 found 1 wanted 2 backpointer mismatch on [9872289792 4096] owner ref check failed [9872289792 4096] ref mismatch on [9889067008 4096] extent item 1, found 0 Backref 9889067008 root 5 not referenced Incorrect global backref count on 9889067008 found 1 wanted 0 backpointer mismatch on [9889067008 4096] owner ref check failed [9889067008 4096] ref mismatch on [37163102208 65536] extent item 1, found 0 Incorrect local backref count on 37163102208 root 5 owner 3360937 offset 0 found 0 wanted 1 backpointer mismatch on [37163102208 65536] owner ref check failed [37163102208 65536] ref mismatch on [37163814912 36864] extent item 0, found 1 Backref 37163814912 root 5 owner 3360939 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 37163814912 root 5 owner 3360939 offset 0 found 1 wanted 0 backpointer mismatch on [37163814912 36864] found 15491117056 bytes used err is 1 total csum bytes: 14764500 total tree bytes: 366956544 total fs tree bytes: 317714432 btree space waste bytes: 90628933 file data blocks allocated: 16182484992 referenced 17813028864 Btrfs Btrfs v0.19-dirty However this doesn't seem to fix anything. Can run it over and over again with same output. btrfs-show does recognize the partition... # btrfs-show /dev/sda1 ** ** WARNING: this program is considered deprecated ** Please consider to switch to the btrfs utility ** failed to read /dev/sde: No medium found Label: none uuid: 9e9886fc-3e60-4c59-a246-727662769ee2 Total devices 1 FS bytes used 14.43GB devid1 size 37.27GB used 34.52GB path /dev/sda1 Btrfs Btrfs v0.19-dirty ...while device scan does not: # btrfs device scan /dev/sda1 Scanning for Btrfs filesystems in '/dev/sda1' Finally dmesg after a mount attempt: [88124.390308] Btrfs detected SSD devices, enabling SSD mode [88124.392354] parent transid verify failed on 9872289792 wanted 152893 found 120351 [88124.392357] parent transid verify failed on 9872289792 wanted 152893 found 120351 [88124.392359] parent transid verify failed on 9872289792 wanted 152893 found 120351 [88124.392360] parent transid verify failed on 9872289792 wanted 152893 found 120351 [88124.392361] parent transid verify failed on 9872289792 wanted 152893 found 120351 [88124.392370] BTRFS: inode 3392566 still on the orphan list [88124.392372] btrfs: could not do orphan cleanup -5 [88124.688187] btrfs: open_ctree failed Any chance to rescue the data? thx tcn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM TOPIC] COWing writeback pages
Hi everyone, The takeaway from the 'stable pages' discussions in the last few workshops was that pages under writeback should remain locked so that subsequent writers don't touch them while they are en route to the disk. This prevents bad checksums and DIF/DIX type failures (whereas previously we didn't really care whether old or new data reached the disk). The fear is/was that anyone subsequently modifying the page will have to wait for writeback io to complete before continuing. I seem to remember somebody (Martin?) saying that in practice, under "real" workloads, that doesn't actually happen, so don't worry about it. (Does anyone remember the details of what testing led to that conclusion?) Anyway, we are seeing what looks like an analogous problem with btrfs, where operations sometimes block waiting for writeback of the btree pages. Although the 'keep rewriting the same page' pattern may not be prevalent in normal file workloads, it does seem to happen with the btrfs btree. The obvious solution seems to be to COW the page if it is under writeback and we want to remodify it. Presumably that can be done just in btrfs, to address the btrfs-specific symptoms we're hitting, but I'm interested in hearing from other folks about whether it's more generally useful VM functionality for other filesystems and other workloads. Unfortunately, we haven't been able to pinpoint the exact scenarios under which this triggers under btrfs. We regularly see long stalls for metadata operations (create() and similar metadata-only operations) that block after btrfs_commit_transaction has "finished" the previous transaction and is doing return filemap_write_and_wait(btree_inode->i_mapping); What we're less clear about is when btrfs will modify the in-memory page in place (and thus wait) versus COWing the page... still digging into this now. It's seems like there is a btrfs-specific question about exactly what is going on and why, which isn't super-relevant for LSF/MM (except that we'll all be there). However, my suspicion is that the solution will be generally applicable to other filesystems, and that the tests that led us to believe that "normal" workloads aren't affected by locked writeback pages would inform which path to take in solving our specific btrfs problem. sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Packed small files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 1/31/2012 11:46 AM, Hugo Mills wrote: > So you're looking at a minimum of 413 bytes of metadata overhead > for an inline file, plus the length of the filename. > > Also note that the file is stored in the metadata, so by default > it's stored with DUP or RAID-1 replication (even if data is set to > be "single"). This means that you'll actually use up twice this > amount of space on the disks, unless you create the FS with > metadata set to "single". > > I don't know how these figures compare with other filesystems. My > entirely uneducated guess is that they're probably comparable, > with the exception of the DUP effect. On ext4 you are looking at 256 bytes for the inode, name length + a few bytes for the directory entry, another few bytes for the hashed directory entry, and a whole 4k block to hold the data, so ~4300 bytes ( + name length ) of overhead to store a 64 byte file. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPNWnBAAoJEJrBOlT6nu75lVkIAIO2mjYeVK5BbMNfw5HJ7jZO WfIBv5xR8V06e0VLgv4FQqPlWcm+ZQHorYDM7h15q4cIgoZ3x0P3n3bSCurFRLfF lSRjn/fsX1Y9isPEB6/monPm+08U6qh7jXGldEMOLKaA7VG/QOVR01k3W2a3FkJ4 kWBjEbK/xE013WaQnfR26PydRT8ILRzGUE4uEKGsdV39JkcEorQ1lDg+XWz5Hvy7 VmelT21272PssIUbRub1QkZXj6p0SUu1zeU1IwOdt6X1uXFcWqFbBFRGJk4f2+ZM 5MquuVC+YrzfDIBnS0ZBZ4UqNmxYuPCSzTLlpPiJJiY/AwR7916H/CoF5k38k/M= =8YmX -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs support for efficient SSD operation (data blocks alignment)
Hi Martin, Am Mittwoch, 8. Februar 2012 schrieb Martin: > My understanding is that for x86 architecture systems, btrfs only > allows a sector size of 4kB for a HDD/SSD. That is fine for the > present HDDs assuming the partitions are aligned to a 4kB boundary for > that device. > > However for SSDs... > > I'm using for example a 60GByte SSD that has: > > 8kB page size; > 16kB logical to physical mapping chunk size; > 2MB erase block size; > 64MB cache. > > And the sector size reported to Linux 3.0 is the default 512 bytes! > > > My first thought is to try formatting with a sector size of 16kB to > align with the SSD logical mapping chunk size. This is to avoid SSD > write amplification. Also, the data transfer performance for that > device is near maximum for writes with a blocksize of 16kB and above. > Yet, btrfs supports a 4kByte page/sector size only at present... Thing is as far as I know the better SSDs and even the dumber ones have quite some intelligence in the firmware. And at least for me its not clear what the firmware of my Intel SSD 320 all does on its own and whether any of my optimization attempts even matter. So I am not sure, whether just thinking about one write operation of say 4 KB or 2 KB singularily even may sense. I bet often several processes write data at once. So there is more amount of data to write. What now is not clear to me whether the SSD will combine several write requests into a single mapping chunk or erase block or combine them into the already erased space of an erase block. I would bet at least the better SSDs would do it. So even when from the OS point of view, in a simplistic example, one write of 1 MB goes to LBA 4 and one write of 1 MB to LBA 8 the SSD might still just use a single erase block and combine the writes next to each other. As far as I understand SSDs do COW to spread writes evenly across erase blocks. As far as I furtherly understand from a seek time point of view the exact location where to put a write request does not matter at all. So for me for an SSD firmware it looks perfectly sane to combine writes as they see fit. And SSDs that carry condensators, like above mentioned Intel SSD, may even cache writes for a while to wait for further requests. The article on write amplication on wikipedia gives me a glimpse of the complexity involvedĀ¹. Yes, I set stripe-width as well on my Ext4 filesystem, but frankly said I am not even sure whether this has any positive effect except of maybe sparing the SSD controller firmware some reshuffling work. So from my current point of view most of what you wrote IMHO is more important for really dumb flash. Like as I understood some kernel developers really like to see so that most of the logic could be put into the kernel and be easily modifyable: JBOF - just a bunch of flash cells with an interface to access them directly. But for now AFAIK most consumer grade SSDs just provide a SATA interface and hide the internals. So an optimization for one kind or one brand of SSDs may not be suitable for another one. There are PCI express models but these probably arenĀ“t dumb either. And then there is the idea of auto commit memory (ACM) by Fusion-IO which just makes a part of the virtual address space persistent. So its a question on where to put the intelligence. For current SSDs is seems the intelligence is really near the storage medium and then IMHO it makes sense to even reduce the intelligence on the Linux side. [1] http://en.wikipedia.org/wiki/Write_amplification Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Generic O_SYNC AIO DIO handling
Jan Kara writes: > Hi Jeff, > > these patches implement generic way of handling O_SYNC AIO DIO. They work > for all filesystems except for ext4 and xfs. Thus together with your patches, > all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3, > btrfs, and xfs (to check that I didn't break anything when the generic code > is unused) and things seem to work fine. Will you add these patches to your > series please? Thanks. Thanks, Jan! I'll add them in and give them some testing. I should be ready to repost the series early next week. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Signed-off-by: Jan Kara --- fs/gfs2/aops.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 501e5cb..9c381ff 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, gfs2_get_block_direct, - NULL, NULL, 0); + NULL, NULL, DIO_SYNC_WRITES); out: gfs2_glock_dq_m(1, &gh); gfs2_holder_uninit(&gh); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] vfs: Handle O_SYNC AIO DIO in generic code properly
Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystem wanting to use the helpers has to pass DIO_SYNC_WRITES to __blockdev_direct_IO. Then if they don't use direct IO end_io handler, generic code takes care of everything else. Otherwise their end_io handler is passed struct dio_sync_io_work pointer as 'private' argument and they have to call generic_dio_end_io() to finish their AIO DIO. Generic code then takes care to call generic_write_sync() from a workqueue context when AIO DIO is completed. Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling and the generic one is enough for them, make blockdev_direct_IO() pass DIO_SYNC_WRITES flag. Signed-off-by: Jan Kara --- fs/direct-io.c | 128 ++-- fs/super.c |2 + include/linux/fs.h | 13 +- 3 files changed, 138 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 4a588db..79aa531 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -38,6 +38,8 @@ #include #include +#include + /* * How many user pages to map in one call to get_user_pages(). This determines * the size of a structure in the slab cache @@ -112,6 +114,15 @@ struct dio_submit { unsigned tail; /* last valid page + 1 */ }; +/* state needed for final sync and completion of O_SYNC AIO DIO */ +struct dio_sync_io_work { + struct kiocb *iocb; + loff_t offset; + ssize_t len; + int ret; + struct work_struct work; +}; + /* dio_state communicated between submission path and end_io */ struct dio { int flags; /* doesn't change */ @@ -134,6 +145,7 @@ struct dio { /* AIO related stuff */ struct kiocb *iocb; /* kiocb */ ssize_t result; /* IO result */ + struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */ /* * pages[] (and any fields placed after it) are not zeroed out at @@ -261,6 +273,45 @@ static inline struct page *dio_get_page(struct dio *dio, } /** + * generic_dio_end_io() - generic dio ->end_io handler + * @iocb: iocb of finishing DIO + * @offset: the byte offset in the file of the completed operation + * @bytes: length of the completed operation + * @work: work to queue for O_SYNC AIO DIO, NULL otherwise + * @ret: error code if IO failed + * @is_async: is this AIO? + * + * This is generic callback to be called when direct IO is finished. It + * handles update of number of outstanding DIOs for an inode, completion + * of async iocb and queueing of work if we need to call fsync() because + * io was O_SYNC. + */ +void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes, + struct dio_sync_io_work *work, int ret, bool is_async) +{ + struct inode *inode = iocb->ki_filp->f_dentry->d_inode; + + if (!is_async) { + inode_dio_done(inode); + return; + } + + /* +* If we need to sync file, we offload completion to workqueue +*/ + if (work) { + work->ret = ret; + work->offset = offset; + work->len = bytes; + queue_work(inode->i_sb->s_dio_flush_wq, &work->work); + } else { + aio_complete(iocb, ret, 0); + inode_dio_done(inode); + } +} +EXPORT_SYMBOL(generic_dio_end_io); + +/** * dio_complete() - called when all DIO BIO I/O has been completed * @offset: the byte offset in the file of the completed operation * @@ -302,12 +353,22 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is ret = transferred; if (dio->end_io && dio->result) { + void *private; + + if (dio->sync_work) + private = dio->sync_work; + else + private = dio->private; dio->end_io(dio->iocb, offset, transferred, - dio->private, ret, is_async); + private, ret, is_async); } else { - if (is_async) - aio_complete(dio->iocb, ret, 0); - inode_dio_done(dio->inode); + /* No IO submitted? Skip syncing... */ + if (!dio->result && dio->sync_work) { + kfree(dio->sync_work); + dio->sync_work = NULL; + } + generic_dio_end_io(dio->iocb, offset, transferred, + dio->sync_work, ret, is_async); } return ret; @@ -1064,6 +1125,41 @@ static inline int drop_refcount(struct dio *dio) } /* + * Work performed from workqueue when AIO DIO is finished. + */ +static void dio_aio_sync_work(struct work_struct *work) +{ + struct dio_sync_io_work *sync_work = + container_of(work, struct dio_sync_io_work, wo
[PATCH 2/4] ocfs2: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Signed-off-by: Jan Kara --- fs/ocfs2/aops.c |6 ++ 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 78b68af..3d14c2b 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb, level = ocfs2_iocb_rw_locked_level(iocb); ocfs2_rw_unlock(inode, level); - if (is_async) - aio_complete(iocb, ret, 0); - inode_dio_done(inode); + generic_dio_end_io(iocb, offset, bytes, private, ret, is_async); } /* @@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw, return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ocfs2_direct_IO_get_blocks, - ocfs2_dio_end_io, NULL, 0); + ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES); } static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Generic O_SYNC AIO DIO handling
Hi Jeff, these patches implement generic way of handling O_SYNC AIO DIO. They work for all filesystems except for ext4 and xfs. Thus together with your patches, all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3, btrfs, and xfs (to check that I didn't break anything when the generic code is unused) and things seem to work fine. Will you add these patches to your series please? Thanks. Honza -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Although we use our own bio->end_io function, we call dio_end_io() from it and thus, because we don't set any specific dio->end_io function, generic code ends up calling generic_dio_end_io() which is all what we need for proper O_SYNC AIO DIO handling. Signed-off-by: Jan Kara --- fs/btrfs/inode.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..68add6e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6221,7 +6221,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, ret = __blockdev_direct_IO(rw, iocb, inode, BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev, iov, offset, nr_segs, btrfs_get_blocks_direct, NULL, - btrfs_submit_direct, 0); + btrfs_submit_direct, DIO_SYNC_WRITES); if (ret < 0 && ret != -EIOCBQUEUED) { clear_extent_bit(&BTRFS_I(inode)->io_tree, offset, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix scrub statistics report
Fixed that errors had been counted multiple times. Signed-off-by: Stefan Behrens --- fs/btrfs/ioctl.h | 30 -- fs/btrfs/scrub.c | 115 -- 2 files changed, 85 insertions(+), 60 deletions(-) diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 4f69028..48b926c 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -49,17 +49,20 @@ struct btrfs_ioctl_vol_args_v2 { * result of a finished scrub, a canceled scrub or a progress inquiry */ struct btrfs_scrub_progress { - __u64 data_extents_scrubbed;/* # of data extents scrubbed */ - __u64 tree_extents_scrubbed;/* # of tree extents scrubbed */ + __u64 data_extents_scrubbed;/* # of 4k data data extents scrubbed */ + __u64 tree_extents_scrubbed;/* # of 4k data tree extents scrubbed */ __u64 data_bytes_scrubbed; /* # of data bytes scrubbed */ __u64 tree_bytes_scrubbed; /* # of tree bytes scrubbed */ - __u64 read_errors; /* # of read errors encountered (EIO) */ - __u64 csum_errors; /* # of failed csum checks */ - __u64 verify_errors;/* # of occurences, where the metadata -* of a tree block did not match the -* expected values, like generation or -* logical */ - __u64 no_csum; /* # of 4k data block for which no csum + __u64 read_errors; /* # of 4k data blocks which encountered +* read errors (EIO) */ + __u64 csum_errors; /* # of 4k data blocks which failed csum +* checks */ + __u64 verify_errors;/* # of 4k data blocks, where the +* metadata of a tree block did not +* match the expected values, like +* generation or logical and the +* checksum was not incorrect */ + __u64 no_csum; /* # of 4k data blocks for which no csum * is present, probably the result of * data written with nodatasum */ __u64 csum_discards;/* # of csum for which no data was found @@ -68,10 +71,11 @@ struct btrfs_scrub_progress { __u64 malloc_errors;/* # of internal kmalloc errors. These * will likely cause an incomplete * scrub */ - __u64 uncorrectable_errors; /* # of errors where either no intact -* copy was found or the writeback -* failed */ - __u64 corrected_errors; /* # of errors corrected */ + __u64 uncorrectable_errors; /* # of 4k data blocks with errors where +* either no intact copy was found or +* the writeback failed */ + __u64 corrected_errors; /* # of 4k data blocks with corrected +* errors */ __u64 last_physical;/* last physical address scrubbed. In * case a scrub was aborted, this can * be used to restart the scrub */ diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 9770cc5..21ea2ab 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -48,10 +48,11 @@ struct scrub_dev; static void scrub_bio_end_io(struct bio *bio, int err); static void scrub_checksum(struct btrfs_work *work); static int scrub_checksum_data(struct scrub_dev *sdev, - struct scrub_page *spag, void *buffer); + struct scrub_page *spag, void *buffer, + int modify_stats); static int scrub_checksum_tree_block(struct scrub_dev *sdev, struct scrub_page *spag, u64 logical, -void *buffer); +void *buffer, int modify_stats); static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer); static int scrub_fixup_check(struct scrub_bio *sbio, int ix); static void scrub_fixup_end_io(struct bio *bio, int err); @@ -555,7 +556,8 @@ out: * recheck_error gets called for every page in the bio, even though only * one may be bad */ -static int scrub_recheck_error(struct scrub_bio *sbio, int ix) +static int scrub_recheck_error(struct scrub_bio *sbio, int ix, + int modify_stats) { struct scrub_dev *sdev = sbio->sdev; u64 sector = (sbio->physical + ix * PAGE_SIZE) >> 9; @@ -575,9 +577,11 @
Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device
On Wed 08-02-12 22:05:26, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 02/08/2012 06:20 PM, Jan Kara wrote: > > Thanks for your reply. I admit I was not sure what exactly size argument > > should be. So after looking into the code for a while I figured it should > > be a total size of the filesystem - or differently it should be size of > > virtual block address space in the filesystem. Thus when filesystem has > > more devices (or admin wants to add more devices later), it can be larger > > than the first device. But I'm not really a btrfs developper so I might be > > wrong and of course feel free to fix the issue as you deem fit. > > The size of the fs is the total size of the individual disks. When you > limit the size, you limit the size of a disk, not the whole fs. IIRC, > mkfs initializes the fs on the first disk, which is why it was using that > size as the size of the whole fs, and then adds the other disks after ( > which then add their size to the total fs size ). OK, I missed that btrfs_add_to_fsid() increases total size of the filesystem. So now I agree with you. New patch is attached. Thanks for your review. > It might be nice if > mkfs could take sizes for each disk, but it only seems to take one size > for the initial disk. Yes, but I don't see a realistic usecase so I don't think it's really worth the work. Honza -- Jan Kara SUSE Labs, CR >From e5f46872232520310c56327593c02ef6a7f5ea33 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 10 Feb 2012 11:44:44 +0100 Subject: [PATCH] mkfs: Handle creation of filesystem larger than the first device mkfs does not properly check requested size of the filesystem. Thus if the requested size is larger than the first device, it happily creates larger filesystem than a device it resides on which results in 'attemp to access beyond end of device' messages from the kernel. So verify specified filesystem size against the size of the first device. CC: David Sterba Signed-off-by: Jan Kara --- mkfs.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mkfs.c b/mkfs.c index e3ced19..3afe9eb 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1282,6 +1282,10 @@ int main(int ac, char **av) ret = btrfs_prepare_device(fd, file, zero_end, &dev_block_count, &mixed); if (block_count == 0) block_count = dev_block_count; + else if (block_count > dev_block_count) { + fprintf(stderr, "%s is smaller than requested size\n", file); + exit(1); + } } else { ac = 0; file = av[optind++]; -- 1.7.1
Re: [PATCH/RFC] Btrfs: Add conditional ENOSPC debugging.
Hi Mitch, having this patch on the list is a good idea. I've two remarks, just in case it will be included sometimes: On 09.02.2012 22:38, Mitch Harder wrote: > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c > index fe4cd0f..31b717f 100644 > --- a/fs/btrfs/delayed-inode.c > +++ b/fs/btrfs/delayed-inode.c > @@ -656,8 +656,18 @@ static int btrfs_delayed_inode_reserve_metadata( >* EAGAIN to make us stop the transaction we have, so return >* ENOSPC instead so that btrfs_dirty_inode knows what to do. >*/ > +#ifdef BTRFS_DEBUG_ENOSPC > + if (unlikely(ret == -EAGAIN)) { > + ret = -ENOSPC; > + if (printk_ratelimit()) >From linux/printk.h: /* * Please don't use printk_ratelimit(), because it shares ratelimiting state * with all other unrelated printk_ratelimit() callsites. Instead use * printk_ratelimited() or plain old __ratelimit(). */ printk_ratelimited() seems the right choice, here. > + printk(KERN_WARNING > +"btrfs: ENOSPC set in " > + > "btrfs_delayed_inode_reserve_metadata\n"); > + } > +#else > if (ret == -EAGAIN) > ret = -ENOSPC; > +#endif I don't like the #if placements throughout the patch. Copying all the conditions is error prone (especially when changes are made later on). I'd rather leave all the conditions as they stand (possibly adding the unlikely() macro if you wish). Same for the following return statement or assignment. Simply put printk_ratelimited() into the #if ... #endif. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html