Re: can't read superblock (but could mount)

2012-02-10 Thread btrfs


Quoting Chris Mason :


On Fri, Feb 10, 2012 at 05:18:42PM -0500, Chris Mason wrote:

Ok, step one:

Pull down the dangerdonteveruse branch of btrfs-progs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git  
dangerdonteveruse


Run btrfs-debug-tree -r /dev/sda1 and send the output here please.


Sorry, that's btrfs-debug-tree -R /dev/sda1


# ./btrfs-debug-tree -R /dev/sda1
root tree: 10229936128 level 1
chunk tree: 10364125184 level 1
extent tree key (EXTENT_TREE ROOT_ITEM 0) 10229944320 level 3
device tree key (DEV_TREE ROOT_ITEM 0) 10192654336 level 0
fs tree key (FS_TREE ROOT_ITEM 0) 10103791616 level 3
checksum tree key (CSUM_TREE ROOT_ITEM 0) 10103156736 level 2
data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 10027970560 level 0
btrfs root backup slot 0
tree root gen 157090 block 10229907456
extent root gen 157090 block 10229911552
chunk root gen 124887 block 10364125184
device root gen 124887 block 10192654336
csum root gen 157032 block 10103156736
fs root gen 157032 block 10103791616
15491051520 used 40020631552 total 1 devices
btrfs root backup slot 1
tree root gen 157091 block 10229948416
extent root gen 157091 block 10229952512
chunk root gen 124887 block 10364125184
device root gen 124887 block 10192654336
csum root gen 157032 block 10103156736
fs root gen 157032 block 10103791616
15491051520 used 40020631552 total 1 devices
btrfs root backup slot 2
tree root gen 157092 block 10229907456
extent root gen 157092 block 10229911552
chunk root gen 124887 block 10364125184
device root gen 124887 block 10192654336
csum root gen 157032 block 10103156736
fs root gen 157032 block 10103791616
15491051520 used 40020631552 total 1 devices
btrfs root backup slot 3
tree root gen 157093 block 10229936128
extent root gen 157093 block 10229944320
chunk root gen 124887 block 10364125184
device root gen 124887 block 10192654336
csum root gen 157032 block 10103156736
fs root gen 157032 block 10103791616
15491051520 used 40020631552 total 1 devices
total bytes 40020631552
bytes used 15491051520
uuid 9e9886fc-3e60-4c59-a246-727662769ee2
Btrfs Btrfs v0.19


But we it is that the bootloader apparently is able to load (at least)  
the kernel (and initrd) from the partition?




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-raid questions I couldn't find an answer to on the wiki

2012-02-10 Thread Duncan
Phillip Susi posted on Fri, 10 Feb 2012 14:45:43 -0500 as excerpted:

> On 1/31/2012 12:55 AM, Duncan wrote:
>> Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but I
>> recently unmasked and upgraded to it, taking advantage of the fact that
>> I have two two-spindle md/raid-1s for /boot and its backup to test and
>> upgrade one of them first, then the other only when I was satisfied
>> with the results on the first set.  I'll be using a similar strategy
>> for the btrfs upgrades, only most of my md/raid-1s are 4-spindle, with
>> two sets, working and backup, and I'll upgrade one set first.
> 
> Why do you want to have a separate /boot partition?  Unless you can't
> boot without it, having one just makes things more complex/problematic. 
> If you do have one, I agree that it is best to keep it ext4 not btrfs.

For a proper picture of the situation, understand that I don't have an 
initr*, I build everything I need into the kernel and have module loading 
disabled, and I keep /boot unmounted except when I'm actually installing 
an upgrade or reconfiguring.

Having a separate /boot means that I can keep it unmounted and thus free 
from possible random corruption or accidental partial /boot tree 
overwrite or deletion, most of the time.  It also means that I can emerge 
(build from sources using the gentoo ebuild script provided for the 
purpose, and install to the live system) a new grub without fear of 
corrupting what I actually boot from -- the grub system installation and 
boot installation remain separate.

A separate /boot is also more robust in terms of file system corruption 
-- if something goes wrong with my rootfs, I can simply boot its backup, 
from a separate /boot that will not have been corrupted.  Similarly, if 
something goes wrong with /boot (or the bios partition), I can switch 
drives in the BIOS and boot from the backup /boot, then load my usual 
rootfs.

Since I'm working with four drives, and both the working /boot and 
backup /boot are two-spindle md/raid1, one on one pair, one on the other, 
I have both hardware redundancy via the second spindle of the raid1, and 
admin-fatfinger redundancy via the backup.  However, the rootfs and its 
backup are both on quad-spindle md/raid1s, thus giving me four separate 
physical copies each of rootfs and its backup.  Because the disk points 
at a single bootloader, if /boot is on rootfs, all four would point to 
either the working rootfs or the backup rootfs, and would update 
together, so I'd lose the ability to fall back to the backup /boot.

(Note that I developed the backup /boot policy and solution back on 
legacy-grub.  Grub2 is rather more flexible, particularly with a 
reasonably roomy GPT BIOS partition, and since each BIOS partition is 
installed individually, in theory, if a grub2 update failed, I could 
point the BIOS at a disk I hadn't installed the BIOS partition update to 
yet, boot to the limited grub rescue-mode-shell, and point it at the 
/boot in the backup rootfs to load the normal-mode-shell, menu, and 
additional grub2 modules as necessary.  However, being able to access a 
full normal-mode-shell grub2 on the backup /boot instead of having to 
resort to the grub2 rescue-mode-shell to reach the backup rootfs, does 
have its benefits.)

One of the nice things about grub2 normal-mode is that it allows 
(directory and plain text file) browsing of pretty much anything it has a 
module for, anywhere on the system.  That's a nice thing to be able to 
do, but it too is much more robust if /boot isn't part of rootfs, and 
thus, isn't likely to be damaged if the rootfs is.  The ability to boot 
to grub2 and retrieve vital information (even if limited to plain-text 
file storage) from a system without a working rootfs is a very nice 
ability to have! 

So you see, a separate /boot really does have its uses. =:^)

>> Meanwhile, you're right about subvolumes.  I'd not try them on a btrfs
>> /boot, either.  (I don't really see the use case for it, for a separate
>> /boot, tho there's certainly a case for a /boot subvolume on a btrfs
>> root, for people doing that.)
> 
> The Ubuntu installer creates two subvolumes by default when you install
> on btrfs: one named @, mounted on /, and one named @home, mounted on
> /home.  Grub2 handles this well since the subvols have names in the
> default root, so grub just refers to /@/boot instead of /boot, and so
> on.  The apt-btrfs-snapshot package makes apt automatically snapshot the
> root subvol so you can revert after an upgrade.  This seamlessly causes
> grub to go back to the old boot menu without the new kernels too, since
> it goes back to reading the old grub.cfg in the reverted root subvol.

Thanks for that "real world" example.  Subvolumes and particularly 
snapshots can indeed be quite useful, but I'd be rather leery of having 
all that on the same master filesystem.  Lose it and you've lost 
everything, snapshots or no snapshots, if there's not bootable backups 
somewhere.

Hot data Tracking

2012-02-10 Thread Timo Witte
What happened to the hot data tracking feature in btrfs? There are a lot
of old patches from aug 2010, but it looks like the feature has been
completly removed from the current version of btrfs. Is this feature
still on the roadmap?



signature.asc
Description: OpenPGP digital signature


Re: can't read superblock (but could mount)

2012-02-10 Thread Chris Mason
On Fri, Feb 10, 2012 at 05:18:42PM -0500, Chris Mason wrote:
> On Fri, Feb 10, 2012 at 08:30:51PM +0100, bt...@nentwig.biz wrote:
> > Hi!
> > 
> > I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot).
> > When switching to 3.2.5 recently the system fails to boot:
> > 
> > (after udevd)
> > /etc/rc.sysinit: line 15: 117 Bus error  mountpoint -q /proc
> > and so on, no idea.
> > 
> > It used to boot with 3.2.4, but
> > 
> > 1) I obviously had some corruption in the tree, when I tried to delete a
> > certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message.
> > 
> > 2) Even while running 3.2.4 I was unable to mount the partition from a
> > parallel gentoo or live USB install and I still am:
> > 
> > # mount /dev/sda1 /mnt/arch/
> > mount: /dev/sda1: can't read superblock
> > 
> > The strange thing is: when trying to boot from the partition the
> > boot loader (syslinux) is
> > obviously still able load the kernel from that partition.
> > 
> > Tried btrfs-zero-log and some deperate other things. Result: I can
> > now actually
> > execute btrfsck which previously used to fail:
> 
> Ok, step one:
> 
> Pull down the dangerdonteveruse branch of btrfs-progs:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git 
> dangerdonteveruse
> 
> Run btrfs-debug-tree -r /dev/sda1 and send the output here please.

Sorry, that's btrfs-debug-tree -R /dev/sda1

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvolume info in /proc/mounts

2012-02-10 Thread Nikos Voutsinas
> On Sat, Feb 04, 2012 at 08:40:51PM +0200, Nikos Voutsinas wrote:
>> There was a patch to include the subvolume mount option into
>> /proc/mounts.
>> Did that make it into the kernel;
>> If not, what is the formal way to find out which subvolume is mounted;
>  see /proc/self/mountinfo there is 'fs root' column, for bind mounts and
btrfs subvolumes is there '/' (and '/' for normal mounts).
findmnt(8) uses the path in SOURCE column, for example
> /dev/sda1[/subvolume].

Thank you Karel, at least the list now has an answer on how to find the
mounted subvolume. For a moment I thought that no one had noticed this
part of the question.

In any production scenario, the first time you try to explore btrfs
subvolume capabilities it becomes obvious that this feature is missing and
it is weird how this has overlooked for such a long time.

Nikos



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't read superblock (but could mount)

2012-02-10 Thread Chris Mason
On Fri, Feb 10, 2012 at 08:30:51PM +0100, bt...@nentwig.biz wrote:
> Hi!
> 
> I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot).
> When switching to 3.2.5 recently the system fails to boot:
> 
> (after udevd)
> /etc/rc.sysinit: line 15: 117 Bus error  mountpoint -q /proc
> and so on, no idea.
> 
> It used to boot with 3.2.4, but
> 
> 1) I obviously had some corruption in the tree, when I tried to delete a
> certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message.
> 
> 2) Even while running 3.2.4 I was unable to mount the partition from a
> parallel gentoo or live USB install and I still am:
> 
> # mount /dev/sda1 /mnt/arch/
> mount: /dev/sda1: can't read superblock
> 
> The strange thing is: when trying to boot from the partition the
> boot loader (syslinux) is
> obviously still able load the kernel from that partition.
> 
> Tried btrfs-zero-log and some deperate other things. Result: I can
> now actually
> execute btrfsck which previously used to fail:

Ok, step one:

Pull down the dangerdonteveruse branch of btrfs-progs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git 
dangerdonteveruse

Run btrfs-debug-tree -r /dev/sda1 and send the output here please.

This block with bad transid is from your FS root.  We'll need to a root
that matches.  But we should be able to patch things up!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs unmountable after failed suspend

2012-02-10 Thread Chris Mason
On Thu, Feb 09, 2012 at 06:54:42PM -0600, Chester wrote:
> Output for btrfs-debug-tree for that specific block in dmesg
> 
> And available here ( http://pastebin.com/AgdvS5JM ) in case the lines
> wrap and look ugly
> 
> leaf 653297209344 items 43 free space 673 generation 332442 owner 2
> fs uuid 0f5b2f4f-1aa0-4e6f-b904-e5b4d4588144
> chunk uuid 27536f0d-993b-4da3-85eb-1c9b08c435cb
>   item 7 key (653284831232 EXTENT_ITEM 4096) itemoff 3551 itemsize 51
>   extent refs 1 gen 332218 flags 2
>   tree block key (654686961664 a8 4096) level 0
>   tree block backref root 2
>   item 8 key (653284818944 EXTENT_ITEM 4096) itemoff 3500 itemsize 51
>   extent refs 1 gen 332218 flags 2
>   tree block key (654687121408 a8 4096) level 0
>   tree block backref root 2

Ok, its worth pointing out that you're just one bit away from proper
ordering here.  While I'm testing out this code to fix key ordering,
could you please run memtest86 on your machine?

The other fields all look correct.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: honor umask when creating subvol root

2012-02-10 Thread Florian Albrechtskirchinger
Set the subvol root inode permissions based on the current umask.
---
 fs/btrfs/inode.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..b88e71a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6696,8 +6696,10 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle 
*trans,
int err;
u64 index = 0;
 
-   inode = btrfs_new_inode(trans, new_root, NULL, "..", 2, new_dirid,
-   new_dirid, S_IFDIR | 0700, &index);
+   inode = btrfs_new_inode(trans, new_root, NULL, "..", 2,
+   new_dirid, new_dirid,
+   S_IFDIR | (~current_umask() & S_IRWXUGO),
+   &index);
if (IS_ERR(inode))
return PTR_ERR(inode);
inode->i_op = &btrfs_dir_inode_operations;
-- 
1.7.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Josef Bacik
On Fri, Feb 10, 2012 at 12:49:50PM -0800, Sage Weil wrote:
> On Fri, 10 Feb 2012, Josef Bacik wrote:
> > On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote:
> > > Hi everyone,
> > > 
> > > The takeaway from the 'stable pages' discussions in the last few 
> > > workshops 
> > > was that pages under writeback should remain locked so that subsequent 
> > > writers don't touch them while they are en route to the disk.  This 
> > > prevents bad checksums and DIF/DIX type failures (whereas previously we 
> > > didn't really care whether old or new data reached the disk).
> > > 
> > > The fear is/was that anyone subsequently modifying the page will have to 
> > > wait for writeback io to complete before continuing.  I seem to remember 
> > > somebody (Martin?) saying that in practice, under "real" workloads, that 
> > > doesn't actually happen, so don't worry about it.  (Does anyone remember 
> > > the details of what testing led to that conclusion?)
> > > 
> > > Anyway, we are seeing what looks like an analogous problem with btrfs, 
> > > where operations sometimes block waiting for writeback of the btree 
> > > pages.  
> > > Although the 'keep rewriting the same page' pattern may not be prevalent 
> > > in normal file workloads, it does seem to happen with the btrfs btree.
> > > 
> > > The obvious solution seems to be to COW the page if it is under writeback 
> > > and we want to remodify it.  Presumably that can be done just in btrfs, 
> > > to 
> > > address the btrfs-specific symptoms we're hitting, but I'm interested in 
> > > hearing from other folks about whether it's more generally useful VM 
> > > functionality for other filesystems and other workloads.
> > > 
> > > Unfortunately, we haven't been able to pinpoint the exact scenarios under 
> > > which this triggers under btrfs.  We regularly see long stalls for 
> > > metadata operations (create() and similar metadata-only operations) that 
> > > block after btrfs_commit_transaction has "finished" the previous 
> > > transaction and is doing
> > > 
> > >   return filemap_write_and_wait(btree_inode->i_mapping);
> > > 
> > > What we're less clear about is when btrfs will modify the in-memory page 
> > > in place (and thus wait) versus COWing the page... still digging into 
> > > this 
> > > now.
> > > 
> > 
> > Heh so I'm working on this now, specifically in the heavy create() 
> > workload, and
> > I've just about got it nailed down.  A lot of this problem is because we 
> > rely on
> > normal pagecache for our metadata so I'm copying xfs and creating our own
> > caching.
> > 
> > The thing is since we have an inode hanging out with normal pagecache pages 
> > we
> > can have multiple people trying to write out dirty pages in our inode at the
> > same time, and since it goes through our normal write path we'll end up in 
> > this
> > case where we're waiting on writeback for pages we won't actually end up 
> > writing
> > out.  My code will fix this, if we're talking about the same problem ;).
> 
> Oh, I hadn't thought of that... that sounds like a similar but slightly 
> different problem, since it probably wouldn't correlate with the 
> filemap_write_and_wait().  As long as we don't have a btree update waiting 
> on btree writeback, though, both problems should be addressed.
> 

Oh yeah that problem is taken care of, IO is completely seperate from updating,
we set the BUF_WRITTEN flag in the header right before writing out so then the
thing will be COW'ed if anybody tries to modify it while it's in flight, they
won't have to wait or anything.  Course now that I think about it thats what
should be happening today anyway, so I'm confused about what you are seeing.

> In any case, we're definitely interested in checking out the code when 
> it's ready to share!
> 

Well I've been committing my progress to my git tree so you can check it out,
but what's there won't work at all, and what I have (and will commit shortly)
works pretty well provided you don't do anything with the tree-log, for some
reason I'm screwing something up there and its crashing :).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Sage Weil
On Fri, 10 Feb 2012, Josef Bacik wrote:
> On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote:
> > Hi everyone,
> > 
> > The takeaway from the 'stable pages' discussions in the last few workshops 
> > was that pages under writeback should remain locked so that subsequent 
> > writers don't touch them while they are en route to the disk.  This 
> > prevents bad checksums and DIF/DIX type failures (whereas previously we 
> > didn't really care whether old or new data reached the disk).
> > 
> > The fear is/was that anyone subsequently modifying the page will have to 
> > wait for writeback io to complete before continuing.  I seem to remember 
> > somebody (Martin?) saying that in practice, under "real" workloads, that 
> > doesn't actually happen, so don't worry about it.  (Does anyone remember 
> > the details of what testing led to that conclusion?)
> > 
> > Anyway, we are seeing what looks like an analogous problem with btrfs, 
> > where operations sometimes block waiting for writeback of the btree pages.  
> > Although the 'keep rewriting the same page' pattern may not be prevalent 
> > in normal file workloads, it does seem to happen with the btrfs btree.
> > 
> > The obvious solution seems to be to COW the page if it is under writeback 
> > and we want to remodify it.  Presumably that can be done just in btrfs, to 
> > address the btrfs-specific symptoms we're hitting, but I'm interested in 
> > hearing from other folks about whether it's more generally useful VM 
> > functionality for other filesystems and other workloads.
> > 
> > Unfortunately, we haven't been able to pinpoint the exact scenarios under 
> > which this triggers under btrfs.  We regularly see long stalls for 
> > metadata operations (create() and similar metadata-only operations) that 
> > block after btrfs_commit_transaction has "finished" the previous 
> > transaction and is doing
> > 
> > return filemap_write_and_wait(btree_inode->i_mapping);
> > 
> > What we're less clear about is when btrfs will modify the in-memory page 
> > in place (and thus wait) versus COWing the page... still digging into this 
> > now.
> > 
> 
> Heh so I'm working on this now, specifically in the heavy create() workload, 
> and
> I've just about got it nailed down.  A lot of this problem is because we rely 
> on
> normal pagecache for our metadata so I'm copying xfs and creating our own
> caching.
> 
> The thing is since we have an inode hanging out with normal pagecache pages we
> can have multiple people trying to write out dirty pages in our inode at the
> same time, and since it goes through our normal write path we'll end up in 
> this
> case where we're waiting on writeback for pages we won't actually end up 
> writing
> out.  My code will fix this, if we're talking about the same problem ;).

Oh, I hadn't thought of that... that sounds like a similar but slightly 
different problem, since it probably wouldn't correlate with the 
filemap_write_and_wait().  As long as we don't have a btree update waiting 
on btree writeback, though, both problems should be addressed.

In any case, we're definitely interested in checking out the code when 
it's ready to share!

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Josef Bacik
On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote:
> Hi everyone,
> 
> The takeaway from the 'stable pages' discussions in the last few workshops 
> was that pages under writeback should remain locked so that subsequent 
> writers don't touch them while they are en route to the disk.  This 
> prevents bad checksums and DIF/DIX type failures (whereas previously we 
> didn't really care whether old or new data reached the disk).
> 
> The fear is/was that anyone subsequently modifying the page will have to 
> wait for writeback io to complete before continuing.  I seem to remember 
> somebody (Martin?) saying that in practice, under "real" workloads, that 
> doesn't actually happen, so don't worry about it.  (Does anyone remember 
> the details of what testing led to that conclusion?)
> 
> Anyway, we are seeing what looks like an analogous problem with btrfs, 
> where operations sometimes block waiting for writeback of the btree pages.  
> Although the 'keep rewriting the same page' pattern may not be prevalent 
> in normal file workloads, it does seem to happen with the btrfs btree.
> 
> The obvious solution seems to be to COW the page if it is under writeback 
> and we want to remodify it.  Presumably that can be done just in btrfs, to 
> address the btrfs-specific symptoms we're hitting, but I'm interested in 
> hearing from other folks about whether it's more generally useful VM 
> functionality for other filesystems and other workloads.
> 
> Unfortunately, we haven't been able to pinpoint the exact scenarios under 
> which this triggers under btrfs.  We regularly see long stalls for 
> metadata operations (create() and similar metadata-only operations) that 
> block after btrfs_commit_transaction has "finished" the previous 
> transaction and is doing
> 
>   return filemap_write_and_wait(btree_inode->i_mapping);
> 
> What we're less clear about is when btrfs will modify the in-memory page 
> in place (and thus wait) versus COWing the page... still digging into this 
> now.
> 

Heh so I'm working on this now, specifically in the heavy create() workload, and
I've just about got it nailed down.  A lot of this problem is because we rely on
normal pagecache for our metadata so I'm copying xfs and creating our own
caching.

The thing is since we have an inode hanging out with normal pagecache pages we
can have multiple people trying to write out dirty pages in our inode at the
same time, and since it goes through our normal write path we'll end up in this
case where we're waiting on writeback for pages we won't actually end up writing
out.  My code will fix this, if we're talking about the same problem ;).
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-raid questions I couldn't find an answer to on the wiki

2012-02-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 1/31/2012 12:55 AM, Duncan wrote:
> Thanks!  I'm on grub2 as well.  It's is still masked on gentoo, but
> I recently unmasked and upgraded to it, taking advantage of the
> fact that I have two two-spindle md/raid-1s for /boot and its
> backup to test and upgrade one of them first, then the other only
> when I was satisfied with the results on the first set.  I'll be
> using a similar strategy for the btrfs upgrades, only most of my
> md/raid-1s are 4-spindle, with two sets, working and backup, and
> I'll upgrade one set first.

Why do you want to have a separate /boot partition?  Unless you can't
boot without it, having one just makes things more
complex/problematic.  If you do have one, I agree that it is best to
keep it ext4 not btrfs.

> Meanwhile, you're right about subvolumes.  I'd not try them on a
> btrfs /boot, either.  (I don't really see the use case for it, for
> a separate /boot, tho there's certainly a case for a /boot
> subvolume on a btrfs root, for people doing that.)

The Ubuntu installer creates two subvolumes by default when you
install on btrfs: one named @, mounted on /, and one named @home,
mounted on /home.  Grub2 handles this well since the subvols have
names in the default root, so grub just refers to /@/boot instead of
/boot, and so on.  The apt-btrfs-snapshot package makes apt
automatically snapshot the root subvol so you can revert after an
upgrade.  This seamlessly causes grub to go back to the old boot menu
without the new kernels too, since it goes back to reading the old
grub.cfg in the reverted root subvol.

I have a radically different suggestion you might consider rebuilding
your system using.  Partition each disk into only two partitions: one
for bios_grub, and one for everything else ( or just use MBR and skip
the bios_grub partition ).  Give the second partitions to mdadm to
make a raid10 array out of.  If you use a 2x far and 2x offset instead
of the default near layout, you will have an array that can still
handle any 2 of the 4 drives failing, will have twice the capacity of
a 4 way mirror, almost the same sequential read throughput of a 4 way
raid0, and about twice the write throughput of a 4 way mirror.
Partition that array up and put your filesystems on it.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNXPnAAoJEJrBOlT6nu75/d8IAJ0fQ3xWPe6SYBY8nj34mcWh
ql6C4ieMkd07ZCuymT5ZVhWJhtdc6/Vg7ecWmhYdeu4d1WGp4DvTumEYHVl4ZlRk
mT9Lq4SupDL5Dk0nfxZUqY8XnIek3kIG/wgekgdSuLF0J9QFQdCFc25j/idIh0Dy
Gk5NJtgKmsTKUQhzPQZxif8nwWVQzQICm5P//FeOQgx8sq7iVdCQHUxlJEPfsL7m
CVVMJPVk+524rFTWxLZ4KLbXkNE7nrikg7UMlWBtM5gflkU0Y+bfmZKPGcqBCSSn
AId5M5alzjLSLblBqwf8wKpEIiDXBqb6f+bSxqnk5FdKKx5l5lziZyqQM+gnyIo=
=ePD3
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


can't read superblock (but could mount)

2012-02-10 Thread btrfs

Hi!

I used to have arch linux running on 1 btrfs partition (sda1, incl. /boot).
When switching to 3.2.5 recently the system fails to boot:

(after udevd)
/etc/rc.sysinit: line 15: 117 Bus error  mountpoint -q /proc
and so on, no idea.

It used to boot with 3.2.4, but

1) I obviously had some corruption in the tree, when I tried to delete a
certain file I hit e.g. "kernel BUG at fs/btrfs/extent-tree.c" message.

2) Even while running 3.2.4 I was unable to mount the partition from a
parallel gentoo or live USB install and I still am:

# mount /dev/sda1 /mnt/arch/
mount: /dev/sda1: can't read superblock

The strange thing is: when trying to boot from the partition the boot  
loader (syslinux) is

obviously still able load the kernel from that partition.

Tried btrfs-zero-log and some deperate other things. Result: I can now  
actually

execute btrfsck which previously used to fail:

# btrfsck /dev/sda1
Extent back ref already exists for 9872289792 parent 0 root 5
leaf parent key incorrect 9872289792
bad block 9872289792
ref mismatch on [9872289792 4096] extent item 1, found 2
Incorrect global backref count on 9872289792 found 1 wanted 2
backpointer mismatch on [9872289792 4096]
owner ref check failed [9872289792 4096]
ref mismatch on [9889067008 4096] extent item 1, found 0
Backref 9889067008 root 5 not referenced
Incorrect global backref count on 9889067008 found 1 wanted 0
backpointer mismatch on [9889067008 4096]
owner ref check failed [9889067008 4096]
ref mismatch on [37163102208 65536] extent item 1, found 0
Incorrect local backref count on 37163102208 root 5 owner 3360937  
offset 0 found 0 wanted 1

backpointer mismatch on [37163102208 65536]
owner ref check failed [37163102208 65536]
ref mismatch on [37163814912 36864] extent item 0, found 1
Backref 37163814912 root 5 owner 3360939 offset 0 num_refs 0 not found  
in extent tree
Incorrect local backref count on 37163814912 root 5 owner 3360939  
offset 0 found 1 wanted 0

backpointer mismatch on [37163814912 36864]
found 15491117056 bytes used err is 1
total csum bytes: 14764500
total tree bytes: 366956544
total fs tree bytes: 317714432
btree space waste bytes: 90628933
file data blocks allocated: 16182484992
 referenced 17813028864
Btrfs Btrfs v0.19-dirty

However this doesn't seem to fix anything. Can run it over and over  
again with same output.


btrfs-show does recognize the partition...

# btrfs-show /dev/sda1
**
** WARNING: this program is considered deprecated
** Please consider to switch to the btrfs utility
**
failed to read /dev/sde: No medium found
Label: none  uuid: 9e9886fc-3e60-4c59-a246-727662769ee2
Total devices 1 FS bytes used 14.43GB
devid1 size 37.27GB used 34.52GB path /dev/sda1

Btrfs Btrfs v0.19-dirty


...while device scan does not:

# btrfs device scan /dev/sda1
Scanning for Btrfs filesystems in '/dev/sda1'


Finally dmesg after a mount attempt:

[88124.390308] Btrfs detected SSD devices, enabling SSD mode
[88124.392354] parent transid verify failed on 9872289792 wanted  
152893 found 120351
[88124.392357] parent transid verify failed on 9872289792 wanted  
152893 found 120351
[88124.392359] parent transid verify failed on 9872289792 wanted  
152893 found 120351
[88124.392360] parent transid verify failed on 9872289792 wanted  
152893 found 120351
[88124.392361] parent transid verify failed on 9872289792 wanted  
152893 found 120351

[88124.392370] BTRFS: inode 3392566 still on the orphan list
[88124.392372] btrfs: could not do orphan cleanup -5
[88124.688187] btrfs: open_ctree failed


Any chance to rescue the data?

thx
tcn

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] COWing writeback pages

2012-02-10 Thread Sage Weil
Hi everyone,

The takeaway from the 'stable pages' discussions in the last few workshops 
was that pages under writeback should remain locked so that subsequent 
writers don't touch them while they are en route to the disk.  This 
prevents bad checksums and DIF/DIX type failures (whereas previously we 
didn't really care whether old or new data reached the disk).

The fear is/was that anyone subsequently modifying the page will have to 
wait for writeback io to complete before continuing.  I seem to remember 
somebody (Martin?) saying that in practice, under "real" workloads, that 
doesn't actually happen, so don't worry about it.  (Does anyone remember 
the details of what testing led to that conclusion?)

Anyway, we are seeing what looks like an analogous problem with btrfs, 
where operations sometimes block waiting for writeback of the btree pages.  
Although the 'keep rewriting the same page' pattern may not be prevalent 
in normal file workloads, it does seem to happen with the btrfs btree.

The obvious solution seems to be to COW the page if it is under writeback 
and we want to remodify it.  Presumably that can be done just in btrfs, to 
address the btrfs-specific symptoms we're hitting, but I'm interested in 
hearing from other folks about whether it's more generally useful VM 
functionality for other filesystems and other workloads.

Unfortunately, we haven't been able to pinpoint the exact scenarios under 
which this triggers under btrfs.  We regularly see long stalls for 
metadata operations (create() and similar metadata-only operations) that 
block after btrfs_commit_transaction has "finished" the previous 
transaction and is doing

return filemap_write_and_wait(btree_inode->i_mapping);

What we're less clear about is when btrfs will modify the in-memory page 
in place (and thus wait) versus COWing the page... still digging into this 
now.

It's seems like there is a btrfs-specific question about exactly what is 
going on and why, which isn't super-relevant for LSF/MM (except that we'll 
all be there).  However, my suspicion is that the solution will be 
generally applicable to other filesystems, and that the tests that led us 
to believe that "normal" workloads aren't affected by locked writeback 
pages would inform which path to take in solving our specific btrfs 
problem.

sage

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Packed small files

2012-02-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 1/31/2012 11:46 AM, Hugo Mills wrote:
> So you're looking at a minimum of 413 bytes of metadata overhead 
> for an inline file, plus the length of the filename.
> 
> Also note that the file is stored in the metadata, so by default 
> it's stored with DUP or RAID-1 replication (even if data is set to
> be "single"). This means that you'll actually use up twice this
> amount of space on the disks, unless you create the FS with
> metadata set to "single".
> 
> I don't know how these figures compare with other filesystems. My 
> entirely uneducated guess is that they're probably comparable,
> with the exception of the DUP effect.

On ext4 you are looking at 256 bytes for the inode, name length + a
few bytes for the directory entry, another few bytes for the hashed
directory entry, and a whole 4k block to hold the data, so ~4300 bytes
( + name length ) of overhead to store a 64 byte file.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPNWnBAAoJEJrBOlT6nu75lVkIAIO2mjYeVK5BbMNfw5HJ7jZO
WfIBv5xR8V06e0VLgv4FQqPlWcm+ZQHorYDM7h15q4cIgoZ3x0P3n3bSCurFRLfF
lSRjn/fsX1Y9isPEB6/monPm+08U6qh7jXGldEMOLKaA7VG/QOVR01k3W2a3FkJ4
kWBjEbK/xE013WaQnfR26PydRT8ILRzGUE4uEKGsdV39JkcEorQ1lDg+XWz5Hvy7
VmelT21272PssIUbRub1QkZXj6p0SUu1zeU1IwOdt6X1uXFcWqFbBFRGJk4f2+ZM
5MquuVC+YrzfDIBnS0ZBZ4UqNmxYuPCSzTLlpPiJJiY/AwR7916H/CoF5k38k/M=
=8YmX
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs support for efficient SSD operation (data blocks alignment)

2012-02-10 Thread Martin Steigerwald
Hi Martin,

Am Mittwoch, 8. Februar 2012 schrieb Martin:
> My understanding is that for x86 architecture systems, btrfs only
> allows a sector size of 4kB for a HDD/SSD. That is fine for the
> present HDDs assuming the partitions are aligned to a 4kB boundary for
> that device.
> 
> However for SSDs...
> 
> I'm using for example a 60GByte SSD that has:
> 
> 8kB page size;
> 16kB logical to physical mapping chunk size;
> 2MB erase block size;
> 64MB cache.
> 
> And the sector size reported to Linux 3.0 is the default 512 bytes!
> 
> 
> My first thought is to try formatting with a sector size of 16kB to
> align with the SSD logical mapping chunk size. This is to avoid SSD
> write amplification. Also, the data transfer performance for that
> device is near maximum for writes with a blocksize of 16kB and above.
> Yet, btrfs supports a 4kByte page/sector size only at present...

Thing is as far as I know the better SSDs and even the dumber ones have 
quite some intelligence in the firmware. And at least for me its not clear 
what the firmware of my Intel SSD 320 all does on its own and whether any 
of my optimization attempts even matter.

So I am not sure, whether just thinking about one write operation of say 4 
KB or 2 KB singularily even may sense. I bet often several processes write 
data at once. So there is more amount of data to write.

What now is not clear to me whether the SSD will combine several write 
requests into a single mapping chunk or erase block or combine them into 
the already erased space of an erase block. I would bet at least the 
better SSDs would do it. So even when from the OS point of view, in a 
simplistic example, one write of 1 MB goes to LBA 4 and one write of 1 
MB to LBA 8 the SSD might still just use a single erase block and 
combine the writes next to each other. As far as I understand SSDs do COW 
to spread writes evenly across erase blocks. As far as I furtherly 
understand from a seek time point of view the exact location where to put 
a write request does not matter at all. So for me for an SSD firmware it 
looks perfectly sane to combine writes as they see fit. And SSDs that carry 
condensators, like above mentioned Intel SSD, may even cache writes for a 
while to wait for further requests.

The article on write amplication on wikipedia gives me a glimpse of the 
complexity involvedĀ¹. Yes, I set stripe-width as well on my Ext4 
filesystem, but frankly said I am not even sure whether this has any 
positive effect except of maybe sparing the SSD controller firmware some 
reshuffling work.

So from my current point of view most of what you wrote IMHO is more 
important for really dumb flash. Like as I understood some kernel 
developers really like to see so that most of the logic could be put into 
the kernel and be easily modifyable: JBOF - just a bunch of flash cells 
with an interface to access them directly. But for now AFAIK most consumer 
grade SSDs just provide a SATA interface and hide the internals. So an 
optimization for one kind or one brand of SSDs may not be suitable for 
another one.

There are PCI express models but these probably arenĀ“t dumb either. And 
then there is the idea of auto commit memory (ACM) by Fusion-IO which just 
makes a part of the virtual address space persistent.

So its a question on where to put the intelligence. For current SSDs is 
seems the intelligence is really near the storage medium and then IMHO it 
makes sense to even reduce the intelligence on the Linux side.

[1] http://en.wikipedia.org/wiki/Write_amplification

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Generic O_SYNC AIO DIO handling

2012-02-10 Thread Jeff Moyer
Jan Kara  writes:

>   Hi Jeff,
>
>   these patches implement generic way of handling O_SYNC AIO DIO. They work
> for all filesystems except for ext4 and xfs. Thus together with your patches,
> all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3,
> btrfs, and xfs (to check that I didn't break anything when the generic code
> is unused) and things seem to work fine. Will you add these patches to your
> series please? Thanks.

Thanks, Jan!  I'll add them in and give them some testing.  I should be
ready to repost the series early next week.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara 
---
 fs/gfs2/aops.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 501e5cb..9c381ff 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 
rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
  offset, nr_segs, gfs2_get_block_direct,
- NULL, NULL, 0);
+ NULL, NULL, DIO_SYNC_WRITES);
 out:
gfs2_glock_dq_m(1, &gh);
gfs2_holder_uninit(&gh);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] vfs: Handle O_SYNC AIO DIO in generic code properly

2012-02-10 Thread Jan Kara
Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystem wanting to
use the helpers has to pass DIO_SYNC_WRITES to __blockdev_direct_IO. Then if
they don't use direct IO end_io handler, generic code takes care of everything
else. Otherwise their end_io handler is passed struct dio_sync_io_work pointer
as 'private' argument and they have to call generic_dio_end_io() to finish
their AIO DIO. Generic code then takes care to call generic_write_sync() from
a workqueue context when AIO DIO is completed.

Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
and the generic one is enough for them, make blockdev_direct_IO() pass
DIO_SYNC_WRITES flag.

Signed-off-by: Jan Kara 
---
 fs/direct-io.c |  128 ++--
 fs/super.c |2 +
 include/linux/fs.h |   13 +-
 3 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4a588db..79aa531 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
  * the size of a structure in the slab cache
@@ -112,6 +114,15 @@ struct dio_submit {
unsigned tail;  /* last valid page + 1 */
 };
 
+/* state needed for final sync and completion of O_SYNC AIO DIO */
+struct dio_sync_io_work {
+   struct kiocb *iocb;
+   loff_t offset;
+   ssize_t len;
+   int ret;
+   struct work_struct work;
+};
+
 /* dio_state communicated between submission path and end_io */
 struct dio {
int flags;  /* doesn't change */
@@ -134,6 +145,7 @@ struct dio {
/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
ssize_t result; /* IO result */
+   struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */
 
/*
 * pages[] (and any fields placed after it) are not zeroed out at
@@ -261,6 +273,45 @@ static inline struct page *dio_get_page(struct dio *dio,
 }
 
 /**
+ * generic_dio_end_io() - generic dio ->end_io handler
+ * @iocb: iocb of finishing DIO
+ * @offset: the byte offset in the file of the completed operation
+ * @bytes: length of the completed operation
+ * @work: work to queue for O_SYNC AIO DIO, NULL otherwise
+ * @ret: error code if IO failed
+ * @is_async: is this AIO?
+ *
+ * This is generic callback to be called when direct IO is finished. It
+ * handles update of number of outstanding DIOs for an inode, completion
+ * of async iocb and queueing of work if we need to call fsync() because
+ * io was O_SYNC.
+ */
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+   struct dio_sync_io_work *work, int ret, bool is_async)
+{
+   struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+
+   if (!is_async) {
+   inode_dio_done(inode);
+   return;
+   }
+
+   /*
+* If we need to sync file, we offload completion to workqueue
+*/
+   if (work) {
+   work->ret = ret;
+   work->offset = offset;
+   work->len = bytes;
+   queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+   } else {
+   aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
+   }
+}
+EXPORT_SYMBOL(generic_dio_end_io);
+
+/**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
  *
@@ -302,12 +353,22 @@ static ssize_t dio_complete(struct dio *dio, loff_t 
offset, ssize_t ret, bool is
ret = transferred;
 
if (dio->end_io && dio->result) {
+   void *private;
+
+   if (dio->sync_work)
+   private = dio->sync_work;
+   else
+   private = dio->private;
dio->end_io(dio->iocb, offset, transferred,
-   dio->private, ret, is_async);
+   private, ret, is_async);
} else {
-   if (is_async)
-   aio_complete(dio->iocb, ret, 0);
-   inode_dio_done(dio->inode);
+   /* No IO submitted? Skip syncing... */
+   if (!dio->result && dio->sync_work) {
+   kfree(dio->sync_work);
+   dio->sync_work = NULL;
+   }
+   generic_dio_end_io(dio->iocb, offset, transferred,
+  dio->sync_work, ret, is_async);
}
 
return ret;
@@ -1064,6 +1125,41 @@ static inline int drop_refcount(struct dio *dio)
 }
 
 /*
+ * Work performed from workqueue when AIO DIO is finished.
+ */
+static void dio_aio_sync_work(struct work_struct *work)
+{
+   struct dio_sync_io_work *sync_work =
+   container_of(work, struct dio_sync_io_work, wo

[PATCH 2/4] ocfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara 
---
 fs/ocfs2/aops.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 78b68af..3d14c2b 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
level = ocfs2_iocb_rw_locked_level(iocb);
ocfs2_rw_unlock(inode, level);
 
-   if (is_async)
-   aio_complete(iocb, ret, 0);
-   inode_dio_done(inode);
+   generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
 }
 
 /*
@@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
iov, offset, nr_segs,
ocfs2_direct_IO_get_blocks,
-   ocfs2_dio_end_io, NULL, 0);
+   ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
 }
 
 static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Generic O_SYNC AIO DIO handling

2012-02-10 Thread Jan Kara

  Hi Jeff,

  these patches implement generic way of handling O_SYNC AIO DIO. They work
for all filesystems except for ext4 and xfs. Thus together with your patches,
all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3,
btrfs, and xfs (to check that I didn't break anything when the generic code
is unused) and things seem to work fine. Will you add these patches to your
series please? Thanks.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file. Although we use our own bio->end_io function, we call dio_end_io()
from it and thus, because we don't set any specific dio->end_io function,
generic code ends up calling generic_dio_end_io() which is all what we need
for proper O_SYNC AIO DIO handling.

Signed-off-by: Jan Kara 
---
 fs/btrfs/inode.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..68add6e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6221,7 +6221,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
ret = __blockdev_direct_IO(rw, iocb, inode,
   BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
   iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
-  btrfs_submit_direct, 0);
+  btrfs_submit_direct, DIO_SYNC_WRITES);
 
if (ret < 0 && ret != -EIOCBQUEUED) {
clear_extent_bit(&BTRFS_I(inode)->io_tree, offset,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix scrub statistics report

2012-02-10 Thread Stefan Behrens
Fixed that errors had been counted multiple times.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/ioctl.h |   30 --
 fs/btrfs/scrub.c |  115 --
 2 files changed, 85 insertions(+), 60 deletions(-)

diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 4f69028..48b926c 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -49,17 +49,20 @@ struct btrfs_ioctl_vol_args_v2 {
  * result of a finished scrub, a canceled scrub or a progress inquiry
  */
 struct btrfs_scrub_progress {
-   __u64 data_extents_scrubbed;/* # of data extents scrubbed */
-   __u64 tree_extents_scrubbed;/* # of tree extents scrubbed */
+   __u64 data_extents_scrubbed;/* # of 4k data data extents scrubbed */
+   __u64 tree_extents_scrubbed;/* # of 4k data tree extents scrubbed */
__u64 data_bytes_scrubbed;  /* # of data bytes scrubbed */
__u64 tree_bytes_scrubbed;  /* # of tree bytes scrubbed */
-   __u64 read_errors;  /* # of read errors encountered (EIO) */
-   __u64 csum_errors;  /* # of failed csum checks */
-   __u64 verify_errors;/* # of occurences, where the metadata
-* of a tree block did not match the
-* expected values, like generation or
-* logical */
-   __u64 no_csum;  /* # of 4k data block for which no csum
+   __u64 read_errors;  /* # of 4k data blocks which encountered
+* read errors (EIO) */
+   __u64 csum_errors;  /* # of 4k data blocks which failed csum
+* checks */
+   __u64 verify_errors;/* # of 4k data blocks, where the
+* metadata of a tree block did not
+* match the expected values, like
+* generation or logical and the
+* checksum was not incorrect */
+   __u64 no_csum;  /* # of 4k data blocks for which no csum
 * is present, probably the result of
 * data written with nodatasum */
__u64 csum_discards;/* # of csum for which no data was found
@@ -68,10 +71,11 @@ struct btrfs_scrub_progress {
__u64 malloc_errors;/* # of internal kmalloc errors. These
 * will likely cause an incomplete
 * scrub */
-   __u64 uncorrectable_errors; /* # of errors where either no intact
-* copy was found or the writeback
-* failed */
-   __u64 corrected_errors; /* # of errors corrected */
+   __u64 uncorrectable_errors; /* # of 4k data blocks with errors where
+* either no intact copy was found or
+* the writeback failed */
+   __u64 corrected_errors; /* # of 4k data blocks with corrected
+* errors */
__u64 last_physical;/* last physical address scrubbed. In
 * case a scrub was aborted, this can
 * be used to restart the scrub */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9770cc5..21ea2ab 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -48,10 +48,11 @@ struct scrub_dev;
 static void scrub_bio_end_io(struct bio *bio, int err);
 static void scrub_checksum(struct btrfs_work *work);
 static int scrub_checksum_data(struct scrub_dev *sdev,
-  struct scrub_page *spag, void *buffer);
+  struct scrub_page *spag, void *buffer,
+  int modify_stats);
 static int scrub_checksum_tree_block(struct scrub_dev *sdev,
 struct scrub_page *spag, u64 logical,
-void *buffer);
+void *buffer, int modify_stats);
 static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
 static int scrub_fixup_check(struct scrub_bio *sbio, int ix);
 static void scrub_fixup_end_io(struct bio *bio, int err);
@@ -555,7 +556,8 @@ out:
  * recheck_error gets called for every page in the bio, even though only
  * one may be bad
  */
-static int scrub_recheck_error(struct scrub_bio *sbio, int ix)
+static int scrub_recheck_error(struct scrub_bio *sbio, int ix,
+  int modify_stats)
 {
struct scrub_dev *sdev = sbio->sdev;
u64 sector = (sbio->physical + ix * PAGE_SIZE) >> 9;
@@ -575,9 +577,11 @

Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-10 Thread Jan Kara
On Wed 08-02-12 22:05:26, Phillip Susi wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 02/08/2012 06:20 PM, Jan Kara wrote:
> >   Thanks for your reply. I admit I was not sure what exactly size argument
> > should be. So after looking into the code for a while I figured it should
> > be a total size of the filesystem - or differently it should be size of
> > virtual block address space in the filesystem. Thus when filesystem has
> > more devices (or admin wants to add more devices later), it can be larger
> > than the first device. But I'm not really a btrfs developper so I might be
> > wrong and of course feel free to fix the issue as you deem fit.
> 
> The size of the fs is the total size of the individual disks.  When you
> limit the size, you limit the size of a disk, not the whole fs.  IIRC,
> mkfs initializes the fs on the first disk, which is why it was using that
> size as the size of the whole fs, and then adds the other disks after (
> which then add their size to the total fs size ).
  OK, I missed that btrfs_add_to_fsid() increases total size of the
filesystem. So now I agree with you. New patch is attached. Thanks for your
review.

> It might be nice if
> mkfs could take sizes for each disk, but it only seems to take one size
> for the initial disk.
  Yes, but I don't see a realistic usecase so I don't think it's really
worth the work.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From e5f46872232520310c56327593c02ef6a7f5ea33 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Fri, 10 Feb 2012 11:44:44 +0100
Subject: [PATCH] mkfs: Handle creation of filesystem larger than the first device

mkfs does not properly check requested size of the filesystem. Thus if the
requested size is larger than the first device, it happily creates larger
filesystem than a device it resides on which results in 'attemp to access
beyond end of device' messages from the kernel. So verify specified filesystem
size against the size of the first device.

CC: David Sterba 
Signed-off-by: Jan Kara 
---
 mkfs.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index e3ced19..3afe9eb 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1282,6 +1282,10 @@ int main(int ac, char **av)
 		ret = btrfs_prepare_device(fd, file, zero_end, &dev_block_count, &mixed);
 		if (block_count == 0)
 			block_count = dev_block_count;
+		else if (block_count > dev_block_count) {
+			fprintf(stderr, "%s is smaller than requested size\n", file);
+			exit(1);
+		}
 	} else {
 		ac = 0;
 		file = av[optind++];
-- 
1.7.1



Re: [PATCH/RFC] Btrfs: Add conditional ENOSPC debugging.

2012-02-10 Thread Jan Schmidt
Hi Mitch,

having this patch on the list is a good idea. I've two remarks, just in
case it will be included sometimes:

On 09.02.2012 22:38, Mitch Harder wrote:
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index fe4cd0f..31b717f 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -656,8 +656,18 @@ static int btrfs_delayed_inode_reserve_metadata(
>* EAGAIN to make us stop the transaction we have, so return
>* ENOSPC instead so that btrfs_dirty_inode knows what to do.
>*/
> +#ifdef BTRFS_DEBUG_ENOSPC
> + if (unlikely(ret == -EAGAIN)) {
> + ret = -ENOSPC;
> + if (printk_ratelimit())

>From linux/printk.h:
/*
 * Please don't use printk_ratelimit(), because it shares ratelimiting state
 * with all other unrelated printk_ratelimit() callsites.  Instead use
 * printk_ratelimited() or plain old __ratelimit().
 */

printk_ratelimited() seems the right choice, here.

> + printk(KERN_WARNING
> +"btrfs: ENOSPC set in "
> +
> "btrfs_delayed_inode_reserve_metadata\n");
> + }
> +#else
>   if (ret == -EAGAIN)
>   ret = -ENOSPC;
> +#endif

I don't like the #if placements throughout the patch. Copying all the
conditions is error prone (especially when changes are made later on).

I'd rather leave all the conditions as they stand (possibly adding the
unlikely() macro if you wish). Same for the following return statement
or assignment. Simply put printk_ratelimited() into the #if ... #endif.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html