Re: [PATCH 0/5] btrfs: snapshot deletion via readahead

2012-04-12 Thread Arne Jansen
On 13.04.2012 05:40, Liu Bo wrote:
> On 04/12/2012 11:54 PM, Arne Jansen wrote:
>> This patchset reimplements snapshot deletion with the help of the readahead
>> framework. For this callbacks are added to the framework. The main idea is
>> to traverse many snapshots at once at read many branches at once. This way
>> readahead get many requests at once (currently about 5), giving it the
>> chance to order disk accesses properly. On a single disk, the effect is
>> currently spoiled by sync operations that still take place, mainly checksum
>> deletion. The most benefit can be gained with multiple devices, as all 
>> devices
>> can be fully utilized. It scales quite well with the number of devices.
>> For more details see the commit messages of the individual patches and the
>> source code comments.
>>
>> How it is tested:
>> I created a test volume using David Sterba's stress-subvol-git-aging.sh. It
>> checks out randoms version of the kernel git tree, creating a snapshot from 
>> it
>> from time to time and checks out other versions there, and so on. In the end
>> the fs had 80 subvols with various degrees of sharing between them. The
>> following tests were conducted on it:
>>  - delete a subvol using droptree and check the fs with btrfsck afterwards
>>for consistency
>>  - delete all subvols and verify with btrfs-debug-tree that the extent
>>allocation tree is clean
>>  - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get
>>a good pressure on locking
>>  - add various degrees of memory pressure to the previous test to get pages
>>to expire early from page cache
>>  - enable all relevant kernel debugging options during all tests
>>
>> The performance gain on a single drive was about 20%, on 8 drives about 600%.
>> It depends vastly on the maximum parallelity of the readahead, that is
>> currently hardcoded to about 5. This number is subject to 2 factors, the
>> available RAM and the size of the saved state for a commit. As the full state
>> has to be saved on commit, a large parallelity leads to a large state.
>>
>> Based on this I'll see if I can add delayed checksum deletions and running
>> the delayed refs via readahead, to gain a maximum ordering of I/O ops.
>>
> 
> Hi Arne,
> 
> Can you please show us some user cases in this, or can we get some extra 
> benefits from it? :)

The case I'm most concerned with is having large filesystems (like 20x3T)
with thousands of users on it in thousands of subvolumes and making
hourly snapshots. Creating these snapshots is relatively easy, getting rid
of them is not.
But there are already reports where deleting a single snapshot can take
several days. So we really need a huge speedup here, and this is only
the beginning :)

-Arne

> 
> thanks,
> liubo
> 
>> This patchset is also available at
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree
>>
>> Arne Jansen (5):
>>   btrfs: extend readahead interface
>>   btrfs: add droptree inode
>>   btrfs: droptree structures and initialization
>>   btrfs: droptree implementation
>>   btrfs: use droptree for snapshot deletion
>>
>>  fs/btrfs/Makefile   |2 +-
>>  fs/btrfs/btrfs_inode.h  |4 +
>>  fs/btrfs/ctree.h|   78 ++-
>>  fs/btrfs/disk-io.c  |   19 +
>>  fs/btrfs/droptree.c | 1916 
>> +++
>>  fs/btrfs/free-space-cache.c |  131 +++-
>>  fs/btrfs/free-space-cache.h |   32 +
>>  fs/btrfs/inode.c|3 +-
>>  fs/btrfs/reada.c|  494 +---
>>  fs/btrfs/scrub.c|   29 +-
>>  fs/btrfs/transaction.c  |   35 +-
>>  11 files changed, 2592 insertions(+), 151 deletions(-)
>>  create mode 100644 fs/btrfs/droptree.c
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot speed/mount time regression with 3.4.0-rc2

2012-04-12 Thread cwillu
> dmesg and fstab attached as requested.

Need dmesg after you've hit alt-sysrq-w a couple times during the slow period.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] btrfs: droptree implementation

2012-04-12 Thread Arne Jansen
On 13.04.2012 04:53, Tsutomu Itoh wrote:
> Hi, Arne,
> 
> (2012/04/13 0:54), Arne Jansen wrote:
>> This is an implementation of snapshot deletion using the readahead
>> framework. Multiple snapshots can be deleted at once and the trees
>> are not enumerated sequentially but in parallel in many branches.
>> This way readahead can reorder the request to better utilize all
>> disks. For a more detailed description see inline comments.
>>
>> Signed-off-by: Arne Jansen
>> ---

[snip]

>> +/*
>> + * read the saved state from the droptree inode and prepare everything so
>> + * it gets started by droptree_restart
>> + */
>> +static int droptree_read_state(struct btrfs_fs_info *fs_info,
>> +   struct inode *inode,
>> +   struct reada_control *top_rc,
>> +   struct list_head *droplist)
>> +{
>> +struct io_ctl io_ctl;
>> +u32 version;
>> +u64 generation;
>> +struct droptree_node **stack;
>> +int ret = 0;
>> +
>> +stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS);
>> +if (!stack)
>> +return -ENOMEM;
>> +
>> +io_ctl_init(&io_ctl, inode, fs_info->tree_root);
>> +io_ctl.check_crcs = 0;
>> +io_ctl_prepare_pages(&io_ctl, inode, 1);
>> +
>> +version = io_ctl_get_u32(&io_ctl);
>> +if (version != DROPTREE_VERSION) {
>> +printk(KERN_ERR "btrfs: snapshot deletion state has been saved "
>> +"with a different version, ignored\n");
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +/* FIXME generation is currently not needed */
>> +generation = io_ctl_get_u64(&io_ctl);
>> +
>> +while (1) {
>> +struct btrfs_key key;
>> +int ret;
>> +struct btrfs_root *del_root;
>> +struct droptree_root *dr;
>> +int level;
>> +int max_level;
>> +struct droptree_node *root_dn;
>> +
>> +key.objectid = io_ctl_get_u64(&io_ctl);
>> +if (key.objectid == 0)
>> +break;
>> +
>> +key.type = BTRFS_ROOT_ITEM_KEY;
>> +key.offset = io_ctl_get_u64(&io_ctl);
>> +max_level = level = io_ctl_get_u8(&io_ctl);
>> +
>> +BUG_ON(level<  0 || level>= BTRFS_MAX_LEVEL); /* incons. fs */
>> +del_root = btrfs_read_fs_root_no_radix(fs_info->tree_root,
>> +&key);
>> +if (IS_ERR(del_root)) {
>> +ret = PTR_ERR(del_root);
>> +BUG(); /* inconsistent fs */
>> +}
>> +
>> +root_dn = droptree_alloc_node(NULL);
>> +/*
>> + * FIXME in this phase is should still be possible to undo
>> + * everything and return a failure. Same goes for the allocation
>> + * failures below
>> + */
>> +BUG_ON(!root_dn); /* can't back out */
>> +dr = droptree_add_droproot(droplist, root_dn, del_root);
>> +BUG_ON(!dr); /* can't back out */
>> +
>> +stack[level] = root_dn;
>> +
>> +while (1) {
>> +u64 start;
>> +u64 len;
>> +u64 nritems;
>> +u32 *map;
>> +int n;
>> +int i;
>> +int parent_slot;
>> +struct droptree_node *dn;
>> +
>> +parent_slot = io_ctl_get_u16(&io_ctl);
>> +if (parent_slot == DROPTREE_STATE_GO_UP) {
>> +++level;
>> +BUG_ON(level>  max_level); /* incons. fs */
>> +continue;
>> +}
>> +if (parent_slot == DROPTREE_STATE_GO_DOWN) {
>> +--level;
>> +BUG_ON(level<  0); /* incons. fs */
>> +continue;
>> +}
>> +if (parent_slot == DROPTREE_STATE_END)
>> +break;
>> +start = io_ctl_get_u64(&io_ctl);
>> +if (start == 0)
>> +break;
>> +
>> +len = io_ctl_get_u64(&io_ctl);
>> +nritems = io_ctl_get_u16(&io_ctl);
>> +n = DT_BITS_TO_U32(nritems);
>> +BUG_ON(n>  99); /* incons. fs */
>> +BUG_ON(n == 0); /* incons. fs */
>> +
>> +map = kmalloc(n * sizeof(u32), GFP_NOFS);
>> +BUG_ON(!map); /* can't back out */
>> +
>> +for (i = 0; i<  n; ++i)
>> +map[i] = io_ctl_get_u32(&io_ctl);
>> +
>> +if (level == max_level) {
>> +/* only for root node */
>> +dn = stack[level];
>> +   

Re: Boot speed/mount time regression with 3.4.0-rc2

2012-04-12 Thread Ahmet Inan
On Thu, Apr 12, 2012 at 4:23 PM, Josef Bacik  wrote:
> On Thu, Apr 12, 2012 at 09:37:48AM -0400, Josef Bacik wrote:
>> On Thu, Apr 12, 2012 at 11:22:51AM +0200, Ahmet Inan wrote:
>> > On Wed, Apr 11, 2012 at 7:04 PM, Josef Bacik  wrote:
>> > > On Wed, Apr 11, 2012 at 05:26:29PM +0200, Ahmet Inan wrote:
>> > >> On Tue, Apr 10, 2012 at 5:16 PM, Josef Bacik  wrote:
>> > >> > On Mon, Apr 09, 2012 at 05:20:46PM -0400, Calvin Walton wrote:
>> > >> >> On Mon, 2012-04-09 at 16:54 -0400, Josef Bacik wrote:
>> > >> >> > On Mon, Apr 09, 2012 at 01:10:04PM -0400, Calvin Walton wrote:
>> > >> >> > > On Mon, 2012-04-09 at 11:53 -0400, Calvin Walton wrote:
>> > >> >> > > > Hi,
>> > >> >> > > >
>> > >> >> > > > I have a system that's using a dracut-generated initramfs to 
>> > >> >> > > > mount a
>> > >> >> > > > btrfs root. After upgrading to kernel 3.4.0-rc2 to test it 
>> > >> >> > > > out, I've
>> > >> >> > > > noticed that the process of mounting the root filesystem takes 
>> > >> >> > > > much
>> > >> >> > > > longer with 3.4.0-rc2 than it did with 3.3.1 - nearly 30 
>> > >> >> > > > seconds slower!
>> > >> >>
>> > >> >> > > And the bisect results are in:
>> > >> >> > > 285ff5af6ce358e73f53b55c9efadd4335f4c2ff is the first bad commit
>> > >> >> > > commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff
>> > >> >> > > Author: Josef Bacik 
>> > >> >> > > Date:   Fri Jan 13 15:27:45 2012 -0500
>> > >> >> > >
>> > >> >> > >     Btrfs: remove the ideal caching code>
>> > >> >> >
>> > >> >> > Ok can you give this a whirl?  You are going to have to 
>> > >> >> > boot/reboot a few times
>> > >> >> > to let the cache get re-generated again to make sure it's taken 
>> > >> >> > effect, but
>> > >> >> > hopefully this will help out.  Thanks,
>> > >> >>
>> > >> >> Unfortunately, it doesn't seem to help. Even after 3 or 4 reboots 
>> > >> >> with
>> > >> >> this patch applied I'm still seeing the same delay.
>> > >> >>
>> > >> >
>> > >> > Ok drop that previous patch and give this one a whirl, it helped on 
>> > >> > my laptop.
>> > >> > This is only  half of the problem AFAICS, but it's the easier half to 
>> > >> > fix, in
>> > >> > the meantime I need to lock down why we're not writing out cache for 
>> > >> > a bunch of
>> > >> > block groups, but thats trickier since the messages I need are spit 
>> > >> > out while
>> > >> > I'm shutting down, so I need to get creative.  Let me know if/how 
>> > >> > much this
>> > >> > helps.  Thanks,
>> > >>
>> > >> i have tried your patch and my system still needs several minutes to 
>> > >> boot
>> > >> until it can be used.
>> > >> Also tried to reboot several times - it doesn't look like its getting 
>> > >> better.
>> > >> The last thing the system does when its shutting down is a read-only
>> > >> remount of "/" so no umount.
>> > >> Booting was much faster before i pulled for-linus a few weeks ago but
>> > >> i couldn't find the time to bisect it yet ..
>> > >>
>> > >> please also look at the attached dmesg.txt.
>> > >> this is an core i3 system with 2x2TB BTRFS RAID1 and lots of
>> > >> home directories and snapshots.
>> > >>
>> > >> I'm going to test this patch on twenty more computers but with
>> > >> smaller HDDs and less files and see if it helps to speed up their
>> > >> boot times.
>> > >>
>> > >
>> > > Ok looks like you are running into a different problem.  Could you maybe 
>> > > run
>> > > bootchart and upload the resulting png somewhere so I can look and see 
>> > > what all
>> > > is running while you boot?  Thanks,
>> >
>> > http://aam.mathematik.uni-freiburg.de/IAM/homepages/ainan/bootchart.png
>> >
>> > i have tried your patch now on the twenty more computers i mentioned and
>> > still it takes a minute to remount rw "/" on those, even after several 
>> > reboots.
>> >
>>
>> Oops responding to the whole list this time..
>>
>> Um ouch your system appears to not be doing anything for like 300 seconds but
>> sitting there.  Can you hook up a console and capture sysrq+w while thats 
>> going
>> on?  Also you are mounting with -o space_cache right?  Can I see your dmesg 
>> to
>> make sure it's doing what it's supposed to?  Thanks,
>>
>
> Ok you don't actually have space_cache enabled it looks like, make sure to add
> space_cache to your fstab so it gets enabled, and then reboot a few times to
> make sure everything gets cached right and then it should help.  Thanks,

now i did enable space_cache in fstab and rebooted 4 times,
still no improvement:

http://aam.mathematik.uni-freiburg.de/IAM/homepages/ainan/bootchart_space_cache.png

is it vital to put this space_cache option to the boot argument as well?
mounting "/" readonly in initramfs and booting to it (until remount "/" rw)
is quite fast.

dmesg and fstab attached as requested.

Ahmet


dmesg
Description: Binary data


fstab
Description: Binary data


Re: [PATCH 0/5] btrfs: snapshot deletion via readahead

2012-04-12 Thread cwillu
On Thu, Apr 12, 2012 at 9:40 PM, Liu Bo  wrote:
> On 04/12/2012 11:54 PM, Arne Jansen wrote:
>> This patchset reimplements snapshot deletion with the help of the readahead
>> framework. For this callbacks are added to the framework. The main idea is
>> to traverse many snapshots at once at read many branches at once. This way
>> readahead get many requests at once (currently about 5), giving it the
>> chance to order disk accesses properly. On a single disk, the effect is
>> currently spoiled by sync operations that still take place, mainly checksum
>> deletion. The most benefit can be gained with multiple devices, as all 
>> devices
>> can be fully utilized. It scales quite well with the number of devices.
>> For more details see the commit messages of the individual patches and the
>> source code comments.
>>
>> How it is tested:
>> I created a test volume using David Sterba's stress-subvol-git-aging.sh. It
>> checks out randoms version of the kernel git tree, creating a snapshot from 
>> it
>> from time to time and checks out other versions there, and so on. In the end
>> the fs had 80 subvols with various degrees of sharing between them. The
>> following tests were conducted on it:
>>  - delete a subvol using droptree and check the fs with btrfsck afterwards
>>    for consistency
>>  - delete all subvols and verify with btrfs-debug-tree that the extent
>>    allocation tree is clean
>>  - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get
>>    a good pressure on locking
>>  - add various degrees of memory pressure to the previous test to get pages
>>    to expire early from page cache
>>  - enable all relevant kernel debugging options during all tests
>>
>> The performance gain on a single drive was about 20%, on 8 drives about 600%.
>> It depends vastly on the maximum parallelity of the readahead, that is
>> currently hardcoded to about 5. This number is subject to 2 factors, the
>> available RAM and the size of the saved state for a commit. As the full state
>> has to be saved on commit, a large parallelity leads to a large state.
>>
>> Based on this I'll see if I can add delayed checksum deletions and running
>> the delayed refs via readahead, to gain a maximum ordering of I/O ops.
>>
>
> Hi Arne,
>
> Can you please show us some user cases in this, or can we get some extra 
> benefits from it? :)
>
> thanks,
> liubo

Expiring old backups routinely takes days to complete due to how long
it takes snapshot deletion to finish.  This makes maximizing the
number of retained backups, or even simply ensuring that we have
enough space for the current backup, quite difficult.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] btrfs: snapshot deletion via readahead

2012-04-12 Thread Liu Bo
On 04/12/2012 11:54 PM, Arne Jansen wrote:
> This patchset reimplements snapshot deletion with the help of the readahead
> framework. For this callbacks are added to the framework. The main idea is
> to traverse many snapshots at once at read many branches at once. This way
> readahead get many requests at once (currently about 5), giving it the
> chance to order disk accesses properly. On a single disk, the effect is
> currently spoiled by sync operations that still take place, mainly checksum
> deletion. The most benefit can be gained with multiple devices, as all devices
> can be fully utilized. It scales quite well with the number of devices.
> For more details see the commit messages of the individual patches and the
> source code comments.
> 
> How it is tested:
> I created a test volume using David Sterba's stress-subvol-git-aging.sh. It
> checks out randoms version of the kernel git tree, creating a snapshot from it
> from time to time and checks out other versions there, and so on. In the end
> the fs had 80 subvols with various degrees of sharing between them. The
> following tests were conducted on it:
>  - delete a subvol using droptree and check the fs with btrfsck afterwards
>for consistency
>  - delete all subvols and verify with btrfs-debug-tree that the extent
>allocation tree is clean
>  - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get
>a good pressure on locking
>  - add various degrees of memory pressure to the previous test to get pages
>to expire early from page cache
>  - enable all relevant kernel debugging options during all tests
> 
> The performance gain on a single drive was about 20%, on 8 drives about 600%.
> It depends vastly on the maximum parallelity of the readahead, that is
> currently hardcoded to about 5. This number is subject to 2 factors, the
> available RAM and the size of the saved state for a commit. As the full state
> has to be saved on commit, a large parallelity leads to a large state.
> 
> Based on this I'll see if I can add delayed checksum deletions and running
> the delayed refs via readahead, to gain a maximum ordering of I/O ops.
> 

Hi Arne,

Can you please show us some user cases in this, or can we get some extra 
benefits from it? :)

thanks,
liubo

> This patchset is also available at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree
> 
> Arne Jansen (5):
>   btrfs: extend readahead interface
>   btrfs: add droptree inode
>   btrfs: droptree structures and initialization
>   btrfs: droptree implementation
>   btrfs: use droptree for snapshot deletion
> 
>  fs/btrfs/Makefile   |2 +-
>  fs/btrfs/btrfs_inode.h  |4 +
>  fs/btrfs/ctree.h|   78 ++-
>  fs/btrfs/disk-io.c  |   19 +
>  fs/btrfs/droptree.c | 1916 
> +++
>  fs/btrfs/free-space-cache.c |  131 +++-
>  fs/btrfs/free-space-cache.h |   32 +
>  fs/btrfs/inode.c|3 +-
>  fs/btrfs/reada.c|  494 +---
>  fs/btrfs/scrub.c|   29 +-
>  fs/btrfs/transaction.c  |   35 +-
>  11 files changed, 2592 insertions(+), 151 deletions(-)
>  create mode 100644 fs/btrfs/droptree.c
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki update request: source repo page Was: [PATCH] Btrfs: use i_version instead of our own sequence

2012-04-12 Thread Duncan
Hugo Mills posted on Thu, 12 Apr 2012 22:55:46 +0100 as excerpted:

> On Thu, Apr 12, 2012 at 09:41:17PM +, Duncan wrote:

>> 4) The instructions appear to assume a kernel module an initr* based
>> setup.  What about people who configure and build a custom monolithic
>> kernel, with module loading disabled?
> 
> Then in general, they're stuffed.
> 
> If you want to mount a multi-device filesystem, you have to run
> btrfs dev scan before it's mounted. If that filesystem is your root
> filesystem, then you have to do it before root is mounted. This requires
> an initramfs/initrd.
> 
> It is possible to supply a full list of explicit device names for
> the root FS to the kernel at boot time with the device= mount parameter,
> but this is unreliable at best. We certainly had a very hard time
> getting it to work last time the issue came up on IRC.
> 
> The general advice is -- use a single-device root filesystem, or an
> initramfs. These are simple, supported, and will generally get good
> help. Any other configuration will cause you to be told to use an
> initramfs. So far, I've not heard any concrete reason why one shouldn't
> be used except "ooh, I don't understand them, and they're scary!".

FWIW, device names appear to be reasonably stable, here.  Stable enough 
that I currently have this built into the kernel as part of my kernel 
command line:

md=3,/dev/sda6,/dev/sdb6,/dev/sdc6,/dev/sdd6 root=/dev/md3p1

When I need to override that to mount the primary backup/recovery root, 
this as part of grub2's linux line extends/overrides the kernel builtin:

md=9,/dev/sda12,/dev/sdb12,/dev/sdc12,/dev/sdd12 root=/dev/md9p1

When I boot from thumbdrive or otherwise might trigger device reordering, 
grub's interactivity allows me to find the correct mds and substitute 
device names as appropriate.  And yes, if you're wondering,
init=/bin/bash is tested and known to work, too. =:^)

I don't see why btrfs would have additional kernel device naming or 
finding problems that md doesn't already have.

So while I'd agree that multi-device noinitr* btrfs builtin might not be 
appropriate as a general distro-wide solution, it does seem quite 
reasonable (here) for sysadmins familiar enough with their own systems to 
have custom-built no-module-loading kernels in general, to be able to do 
the same with btrfs.

That's one of a couple reasons I don't use lvm2, as well.  Both lvm2 and 
an initr* add complexity and thus recovery failure risk due to admin fat-
fingering or failure to anticipate and test all permutations of failure 
mode, for little or no gain in my current deployments.  Because lvm2 
requires an initr* to handle root, it's TWO such layers of additional 
complexity to test the failure modes for and be prepared to deal with at 
recovery time.  The added complexity and risk is simply not a reasonable 
tradeoff, for me, and I sleep better with a tested confidence in my 
disaster recovery abilities.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Array Recovery

2012-04-12 Thread Duncan
Duncan posted on Thu, 12 Apr 2012 21:51:49 + as excerpted:

> Travis Shivers posted on Thu, 12 Apr 2012 16:25:49 -0500 as excerpted:
> 
>> The first time I try and mount it, it fails, but logs this in dmesg:
>> (http://pastebin.com/YwAsdjhs)
> 
> I get this 404:  Unknown paste ID.

FWIW, seems my client was including the closing parenthesis in the URL.  
Manually deleting that, it works.  But the point about attachments as 
opposed to pastebins remains valid.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent_io.c:1890!

2012-04-12 Thread Francesco Cepparo
On Thu, Apr 12, 2012 at 6:20 PM, Josef Bacik  wrote:
> On Thu, Apr 12, 2012 at 02:15:25PM -0400, Chris Mason wrote:
>> On Thu, Apr 12, 2012 at 02:08:37PM -0400, Josef Bacik wrote:
>> > On Wed, Apr 11, 2012 at 11:59:43PM +, Francesco Cepparo wrote:
>> > > I tried your patch but unfortunately the kernel still gives me the
>> > > same error message :(
>> >
>> > Weird, will you apply this patch on top of the one I sent you and send me 
>> > the
>> > dmesg when it panics again?  Thanks,
>> >
>> > Josef
>> >
>> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> > index 2a3ddd2..51efb58 100644
>> > --- a/fs/btrfs/disk-io.c
>> > +++ b/fs/btrfs/disk-io.c
>> > @@ -652,6 +652,8 @@ static int btree_io_failed_hook(struct page *page, int 
>> > failed_mirror)
>> >
>> >     eb = (struct extent_buffer *)page->private;
>> >     set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>> > +   WARN_ON(!failed_mirror);
>> > +   printk(KERN_ERR "io error, failed mirror %d\n");
>>                                                     ^
>>
>>                                                   , failed_mirror
>>
>
> pfft compiling debug patches before sending them out is for losers,
>
> Josef

I applied your second patch on top of the first one but the dmesg
output stays the same before you ask, I'm sure I'm compiling the
kernel correctly, as putting the WARN_ON(!failed_mirror) inside the if
(!failed_mirror) on line 391 correctly prints the warnings... I'm not
sure whether the warnings generated in that place are of any interest
but showing them anyway can't hurt:

[   87.041600] device fsid 0a6e2f08-5bfe-434c-ae27-f8670bef9a1c devid
1 transid 168138 /dev/sda6
[   87.041944] btrfs: disk space caching is enabled
[   87.417258] parent transid verify failed on 195091890176 wanted
168040 found 168229
[   87.418315] parent transid verify failed on 195091890176 wanted
168040 found 168229
[   87.418356] failed mirror was 0
[   87.418367] [ cut here ]
[   87.418387] WARNING: at fs/btrfs/disk-io.c:394
btree_read_extent_buffer_pages.constprop.111+0x13a/0x160()
[   87.418416] Hardware name: P5Q SE/R
[   87.418428] Modules linked in: fuse ext4 crc16 jbd2 mbcache rt73usb
rt2x00usb snd_hda_codec_hdmi crc_itu_t rt2x00lib usbhid uvcvideo hid
videobuf2_vmalloc videobuf2_memops videobuf2_core arc4 stv0299
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm
snd_page_alloc snd_timer snd rtl8180 eeprom_93cx6 mac80211 budget_av
budget_core cfg80211 ttpci_eeprom radeon saa7146_vv i2c_algo_bit atl1e
soundcore rfkill saa7146 drm_kms_helper dvb_core videobuf_dma_sg
videobuf_core videodev evdev media ttm drm psmouse serio_raw iTCO_wdt
iTCO_vendor_support button coretemp intel_agp i2c_i801 microcode
intel_gtt processor i2c_core asus_atk0110 autofs4 uhci_hcd ehci_hcd
usbcore usb_common sd_mod ahci sr_mod cdrom libahci pata_marvell
libata scsi_mod
[   87.418835] Pid: 735, comm: mount Not tainted 3.4.0-rc2-mainline #5
[   87.418854] Call Trace:
[   87.418867]  [] warn_slowpath_common+0x7f/0xc0
[   87.418887]  [] warn_slowpath_null+0x1a/0x20
[   87.418907]  []
btree_read_extent_buffer_pages.constprop.111+0x13a/0x160
[   87.418934]  [] read_tree_block+0x3a/0x50
[   87.418953]  [] read_block_for_search.isra.33+0x1f3/0x3a0
[   87.418975]  [] ?
generic_bin_search.constprop.35+0x6b/0x180
[   87.418998]  [] btrfs_search_slot+0x3ec/0x900
[   87.419018]  [] ? verify_parent_transid+0x160/0x160
[   87.419039]  [] btrfs_read_block_groups+0xdf/0x660
[   87.419060]  [] ? update_space_info+0x199/0x1f0
[   87.419080]  [] open_ctree+0x1392/0x1ac0
[   87.420210]  [] ? disk_name+0x61/0xc0
[   87.421352]  [] btrfs_mount+0x5b6/0x6a0
[   87.422469]  [] ? pcpu_alloc+0x8bb/0x9d0
[   87.423598]  [] ? ida_get_new_above+0x218/0x2a0
[   87.424752]  [] mount_fs+0x43/0x1b0
[   87.425869]  [] ? __alloc_percpu+0x10/0x20
[   87.426988]  [] vfs_kern_mount+0x70/0x100
[   87.428123]  [] do_kern_mount+0x54/0x110
[   87.429219]  [] do_mount+0x26a/0x850
[   87.430320]  [] ? strndup_user+0x5b/0x80
[   87.431401]  [] sys_mount+0x8d/0xe0
[   87.432473]  [] system_call_fastpath+0x16/0x1b
[   87.433554] ---[ end trace 63bca69dcc9ebeb7 ]---
[   87.434641] io error, failed mirror 0
[   87.435988] parent transid verify failed on 195091890176 wanted
168040 found 168229
[   87.437144] failed mirror was 0
[   87.438252] [ cut here ]
[   87.439380] WARNING: at fs/btrfs/disk-io.c:394
btree_read_extent_buffer_pages.constprop.111+0x13a/0x160()
[   87.440551] Hardware name: P5Q SE/R
[   87.441722] Modules linked in: fuse ext4 crc16 jbd2 mbcache rt73usb
rt2x00usb snd_hda_codec_hdmi crc_itu_t rt2x00lib usbhid uvcvideo hid
videobuf2_vmalloc videobuf2_memops videobuf2_core arc4 stv0299
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm
snd_page_alloc snd_timer snd rtl8180 eeprom_93cx6 mac80211 budget_av
budget_core cfg80211 ttpci_eeprom radeon saa7146_vv i2c_algo_bit atl1e
soundcore rfkill saa7146 drm_kms_helper dvb_core videobuf_dma_sg

Re: Wiki update request: source repo page Was: [PATCH] Btrfs: use i_version instead of our own sequence

2012-04-12 Thread Hugo Mills
On Thu, Apr 12, 2012 at 09:41:17PM +, Duncan wrote:
> Josef Bacik posted on Thu, 12 Apr 2012 09:31:07 -0400 as excerpted:
> 
> >> BTW.
> >> 1. where is BTRFS devel git tree?
> >> 2. when this is coming to mainline?
> >> 
> >> 
> > There's a bunch, my personal tree with just my patches is here
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work.git
> > 
> > a tree with all outstanding mailinglist patches is here
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
> > 
> > and Chris's tree which is where all things go through to get to mainline
> > is here
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
> > 
> > It will probably be in the next merge window.  Thanks,
> 
> 
> Could this list be added to the btrfs wiki, source repositories page?

   Well, it _is_ a wiki... Knock yourself out.

> http://btrfs.ipv5.de/index.php?title=Btrfs_source_repositories
> 
> While there, please review the dkms information:
[snip dkms]
> 2) Is Chris's tree STILL based on old 2.6.32 without further updates 
> except to btrfs?  If so, the link to it in the earlier btrfs kernel 
> module git repository section should probably have a BIG WARNING TO THAT 
> EFFECT, instead of simply saying it downloads a complete Linux kernel 
> tree.

   No, it's generally based on some recent linux-kernel (usually not
more than one revision out). Possibly some instructions on using git
merge to combine Chris's tree with Linus's would be useful (although I
think it's generally assumed that if you're using git to pull some
arbitrary repo to build from that you know how to drive git to that
degree anyway).

> 3) Further down there's a step that says Patch version script, noting 
> 2.6.27, which is older still.  Has cmason merged that patch?
> 
> 4) The instructions appear to assume a kernel module an initr* based 
> setup.  What about people who configure and build a custom monolithic 
> kernel, with module loading disabled?

   Then in general, they're stuffed.

   If you want to mount a multi-device filesystem, you have to run
btrfs dev scan before it's mounted. If that filesystem is your root
filesystem, then you have to do it before root is mounted. This
requires an initramfs/initrd.

   It is possible to supply a full list of explicit device names for
the root FS to the kernel at boot time with the device= mount
parameter, but this is unreliable at best. We certainly had a very
hard time getting it to work last time the issue came up on IRC.

   The general advice is -- use a single-device root filesystem, or an
initramfs. These are simple, supported, and will generally get good
help. Any other configuration will cause you to be told to use an
initramfs. So far, I've not heard any concrete reason why one
shouldn't be used except "ooh, I don't understand them, and they're
scary!".

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- This year,  I'm giving up Lent. --- 


signature.asc
Description: Digital signature


Re: Btrfs Array Recovery

2012-04-12 Thread Duncan
Travis Shivers posted on Thu, 12 Apr 2012 16:25:49 -0500 as excerpted:

> The first time I try and mount it, it fails, but logs this in dmesg:
> (http://pastebin.com/YwAsdjhs)

I get this 404:  Unknown paste ID.

In general, pastebins aren't particularly well suited for a list like 
this where archived messages may be googled by someone else looking for 
an answer, sometimes years later.  The pastebin is generally long since 
gone, by then.  AFAIK, plain text attachments are allowed and 
recommended.  I'm not sure about images, tho.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Wiki update request: source repo page Was: [PATCH] Btrfs: use i_version instead of our own sequence

2012-04-12 Thread Duncan
Josef Bacik posted on Thu, 12 Apr 2012 09:31:07 -0400 as excerpted:

>> BTW.
>> 1. where is BTRFS devel git tree?
>> 2. when this is coming to mainline?
>> 
>> 
> There's a bunch, my personal tree with just my patches is here
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work.git
> 
> a tree with all outstanding mailinglist patches is here
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git
> 
> and Chris's tree which is where all things go through to get to mainline
> is here
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
> 
> It will probably be in the next merge window.  Thanks,


Could this list be added to the btrfs wiki, source repositories page?

http://btrfs.ipv5.de/index.php?title=Btrfs_source_repositories

While there, please review the dkms information:

0) At least a paragraph actually describing what dkms is/does would be 
extremely useful.  A link to another page on the topic or to an external 
dkms resource for more information is probably in order as well.

1) Near the top of the dkms section, under "You have a very recent 
kernel", the for instance says dkms doesn't work with recent kernels, but 
then backporting is mentioned.  So you want to use it if you have a very 
recent kernel, but it doesn't work with recent kernels and backporting is 
needed?  WTF?

2) Is Chris's tree STILL based on old 2.6.32 without further updates 
except to btrfs?  If so, the link to it in the earlier btrfs kernel 
module git repository section should probably have a BIG WARNING TO THAT 
EFFECT, instead of simply saying it downloads a complete Linux kernel 
tree.

3) Further down there's a step that says Patch version script, noting 
2.6.27, which is older still.  Has cmason merged that patch?

4) The instructions appear to assume a kernel module an initr* based 
setup.  What about people who configure and build a custom monolithic 
kernel, with module loading disabled?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs Array Recovery

2012-04-12 Thread Travis Shivers
A few months ago, my btrfs storage array became corrupted because of a
power failure. A while ago, I made this thread to try and resolve the
problem. (http://www.digipedia.pl/usenet/thread/11904/15955/) You can
find detailed information about the problem in the previous thread,
but I am happy to provide any other details. It didn't really go
anywhere in the way of solving my problem. The thread ended in me
waiting for a patch that would allow me to mount my corrupted array
which was around 2 months ago.

While I have been waiting, I have tried several things. One of the
things that I tried was installing the latest Linux kernel (3.4 RC1)
with the btrfs integrity checking enabled. I read about this module
here: (http://lwn.net/Articles/466493/) With the option compiled in, I
have had severe mounting problems. I can only try to mount the array
once before it does some strange things. The first time I try and
mount it, it fails, but logs this in dmesg:
(http://pastebin.com/YwAsdjhs) It looks like there is a bug in this
integrity checking code. After I try to mount the drive after the
first time, it just hangs and doesn't return anything or log anything
in dmesg. Even trying to mount the drive without the integrity
checking problem hangs and has the same problems.

I have also grabbed the latest version of btrfs-progs since I saw that
btrfsck could now repair some corruptions. I built the utilities and
executed btrfsck. This is the result of the command:
(http://pastebin.com/CEyvy17r) I saw that there was an error occurring
in the code at line 1864, so I commented out that line which had the
text: BUG_ON(rec->is_root);
I then recompiled the utilities and executed btrfsck again and got
this: (http://pastebin.com/ihYmuCAm) I also tried btrfsck with the
repair option with these results: (http://pastebin.com/gnrStyqh)

Another thing that I have experimented with is btrfs-restore. I have
been somewhat successful in using this tool to restore the files. The
main problem that I have is that it cannot restore over half of the
files on the array and just puts an empty file with a size of 0. It
does restore the other half of the array perfectly. On the files that
it cannot restore, it returns a return code of -3. For example, here
is an example of a file which is unable to be restored by this tool:
(http://pastebin.com/Rg5a0xdG) I read more about this tool here
(http://btrfs.ipv5.de/index.php?title=Restore) I tried this tool with
'-u 1' and '-u 2' flags, which did not help anything. I do not think
that half of my array is corrupted since it was just a power failure
and the drive is also mirrored, which should provide some redundancy.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Make free_ipath() deal gracefully with NULL pointers

2012-04-12 Thread Jesper Juhl
Make free_ipath() behave like most other freeing functions in the
kernel and gracefully do nothing when passed a NULL pointer.

Besides this making the bahaviour consistent with functions such as
kfree(), vfree(), btrfs_free_path() etc etc, it also fixes a real NULL
deref issue in fs/btrfs/ioctl.c::btrfs_ioctl_ino_to_path(). In that
function we have this code:

...
ipath = init_ipath(size, root, path);
if (IS_ERR(ipath)) {
ret = PTR_ERR(ipath);
ipath = NULL;
goto out;
}
...
out:
btrfs_free_path(path);
free_ipath(ipath);
...

If we ever take the true branch of that 'if' statement we'll end up
passing a NULL pointer to free_ipath() which will subsequently
dereference it and we'll go "Boom" :-(
This patch will avoid that.

Signed-off-by: Jesper Juhl 
---
 fs/btrfs/backref.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index f4e9074..b332ff0 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1414,6 +1414,8 @@ struct inode_fs_paths *init_ipath(s32 total_bytes, struct 
btrfs_root *fs_root,
 
 void free_ipath(struct inode_fs_paths *ipath)
 {
+   if (!ipath)
+   return;
kfree(ipath->fspath);
kfree(ipath);
 }
-- 
1.7.10


-- 
Jesper Juhlhttp://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: make i_generation a u64

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 03:42:17PM -0400, Ted Ts'o wrote:
> On Wed, Apr 11, 2012 at 04:42:48PM -0400, Josef Bacik wrote:
> > Btrfs stores generation numbers as 64bit numbers, which means we have to
> > carry around a u64 in our incore inode in addition to setting i_generation.
> > So convert to a u64 so btrfs can kill it's incore generation.  Thanks,
> > 
> > Signed-off-by: Josef Bacik 
> 
> Why is btrfs using a 64-bit generation number, out of curiosity?  The
> only user of the inode generation number as far as I can tell is NFS,
> and even NFSv4 is using a 32-bit generation number
> 

It's just tied to our transaction id #'s which are 64bit, no super awesome
reason or anything.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: make i_generation a u64

2012-04-12 Thread Ted Ts'o
On Wed, Apr 11, 2012 at 04:42:48PM -0400, Josef Bacik wrote:
> Btrfs stores generation numbers as 64bit numbers, which means we have to
> carry around a u64 in our incore inode in addition to setting i_generation.
> So convert to a u64 so btrfs can kill it's incore generation.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Why is btrfs using a 64-bit generation number, out of curiosity?  The
only user of the inode generation number as far as I can tell is NFS,
and even NFSv4 is using a 32-bit generation number

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: extended inode refs

2012-04-12 Thread Jan Schmidt
On 12.04.2012 19:59, Jan Schmidt wrote:
>> -static int iterate_irefs(u64 inum, struct btrfs_root *fs_root,
>> -struct btrfs_path *path,
>> -iterate_irefs_t *iterate, void *ctx)
>> +static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root,
>> +  struct btrfs_path *path,
>> +  iterate_irefs_t *iterate, void *ctx)
> 
> This function must not call free_extent_buffer(eb) in line 1306 after
> applying your patch set (immediately before the break). Second, I think
> we'd better add a blocking read lock on eb after incrementing it's
> refcount, because we need the current content to stay as it is. Both
> isn't part of your patches, but it might be easier if you make that
> bugfix change as a 3/4 patch within your set and turn this one into 4/4.
> If you don't like that, I'll send a separate patch for it. Don't miss
> the unlock if you do it ;-)

FYI: There are more read locks missing in the current version of
backref.c. So I made a bugfix patch myself which I'll test and send
tomorrow.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent_io.c:1890!

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 02:15:25PM -0400, Chris Mason wrote:
> On Thu, Apr 12, 2012 at 02:08:37PM -0400, Josef Bacik wrote:
> > On Wed, Apr 11, 2012 at 11:59:43PM +, Francesco Cepparo wrote:
> > > I tried your patch but unfortunately the kernel still gives me the
> > > same error message :(
> >  
> > Weird, will you apply this patch on top of the one I sent you and send me 
> > the
> > dmesg when it panics again?  Thanks,
> > 
> > Josef
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 2a3ddd2..51efb58 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -652,6 +652,8 @@ static int btree_io_failed_hook(struct page *page, int 
> > failed_mirror)
> >  
> > eb = (struct extent_buffer *)page->private;
> > set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> > +   WARN_ON(!failed_mirror);
> > +   printk(KERN_ERR "io error, failed mirror %d\n");
> ^
> 
>   , failed_mirror
> 

pfft compiling debug patches before sending them out is for losers,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent_io.c:1890!

2012-04-12 Thread Chris Mason
On Thu, Apr 12, 2012 at 02:08:37PM -0400, Josef Bacik wrote:
> On Wed, Apr 11, 2012 at 11:59:43PM +, Francesco Cepparo wrote:
> > I tried your patch but unfortunately the kernel still gives me the
> > same error message :(
>  
> Weird, will you apply this patch on top of the one I sent you and send me the
> dmesg when it panics again?  Thanks,
> 
> Josef
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2a3ddd2..51efb58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -652,6 +652,8 @@ static int btree_io_failed_hook(struct page *page, int 
> failed_mirror)
>  
>   eb = (struct extent_buffer *)page->private;
>   set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> + WARN_ON(!failed_mirror);
> + printk(KERN_ERR "io error, failed mirror %d\n");
^

, failed_mirror

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent_io.c:1890!

2012-04-12 Thread Josef Bacik
On Wed, Apr 11, 2012 at 11:59:43PM +, Francesco Cepparo wrote:
> I tried your patch but unfortunately the kernel still gives me the
> same error message :(
 
Weird, will you apply this patch on top of the one I sent you and send me the
dmesg when it panics again?  Thanks,

Josef

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2a3ddd2..51efb58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -652,6 +652,8 @@ static int btree_io_failed_hook(struct page *page, int 
failed_mirror)
 
eb = (struct extent_buffer *)page->private;
set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+   WARN_ON(!failed_mirror);
+   printk(KERN_ERR "io error, failed mirror %d\n");
eb->failed_mirror = failed_mirror;
if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
btree_readahead_hook(root, eb, eb->start, -EIO);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: extended inode refs

2012-04-12 Thread Jan Schmidt
On 05.04.2012 22:09, Mark Fasheh wrote:
> The iterate_irefs in backref.c is used to build path components from inode
> refs. I had to add a 2nd iterate function callback to handle extended refs.
> 
> Both iterate callbacks eventually converge upon iref_to_path() which I was
> able to keep as one function with some small code to abstract away
> differences in the two disk structures.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/backref.c |  200 
> ++--
>  fs/btrfs/backref.h |4 +-
>  2 files changed, 165 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 0436c12..f2b8952 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -22,6 +22,7 @@
>  #include "ulist.h"
>  #include "transaction.h"
>  #include "delayed-ref.h"
> +#include "tree-log.h"

I mentioned that in the 2/3 review mail. I'd rather get rid of this
#include here.

>  
>  /*
>   * this structure records all encountered refs on the way up to the root
> @@ -858,62 +859,75 @@ static int inode_ref_info(u64 inum, u64 ioff, struct 
> btrfs_root *fs_root,
>  }
>  
>  /*
> - * this iterates to turn a btrfs_inode_ref into a full filesystem path. 
> elements
> - * of the path are separated by '/' and the path is guaranteed to be
> - * 0-terminated. the path is only given within the current file system.
> - * Therefore, it never starts with a '/'. the caller is responsible to 
> provide
> - * "size" bytes in "dest". the dest buffer will be filled backwards. finally,
> - * the start point of the resulting string is returned. this pointer is 
> within
> - * dest, normally.
> - * in case the path buffer would overflow, the pointer is decremented further
> - * as if output was written to the buffer, though no more output is actually
> - * generated. that way, the caller can determine how much space would be
> - * required for the path to fit into the buffer. in that case, the returned
> - * value will be smaller than dest. callers must check this!
> + * Given the parent objectid and name/name_len pairs of an inode ref
> + * (any version) this iterates to turn that information into a
> + * full filesystem path. elements of the path are separated by '/' and
> + * the path is guaranteed to be 0-terminated. the path is only given
> + * within the current file system.  Therefore, it never starts with a
> + * '/'. the caller is responsible to provide "size" bytes in
> + * "dest". the dest buffer will be filled backwards. finally, the
> + * start point of the resulting string is returned. this pointer is
> + * within dest, normally.  in case the path buffer would overflow, the
> + * pointer is decremented further as if output was written to the
> + * buffer, though no more output is actually generated. that way, the
> + * caller can determine how much space would be required for the path
> + * to fit into the buffer. in that case, the returned value will be
> + * smaller than dest. callers must check this!

It would reduce patch sets if you can extend comments in a compatible
way, you make reviewers happy if you don't realign text (or, later,
function parameters) where it's not required.

>   */
>  static char *iref_to_path(struct btrfs_root *fs_root, struct btrfs_path 
> *path,
> - struct btrfs_inode_ref *iref,
> - struct extent_buffer *eb_in, u64 parent,
> - char *dest, u32 size)
> +   int name_len, unsigned long name_off,

name_len should be u32

> +   struct extent_buffer *eb_in, u64 parent,
> +   char *dest, u32 size)
>  {
> - u32 len;
>   int slot;
>   u64 next_inum;
>   int ret;
>   s64 bytes_left = size - 1;
>   struct extent_buffer *eb = eb_in;
>   struct btrfs_key found_key;
> + struct btrfs_inode_ref *iref;
> + struct btrfs_inode_extref *iref2;

iextref

>  
>   if (bytes_left >= 0)
>   dest[bytes_left] = '\0';
>  
>   while (1) {
> - len = btrfs_inode_ref_name_len(eb, iref);
> - bytes_left -= len;
> + bytes_left -= name_len;
>   if (bytes_left >= 0)
>   read_extent_buffer(eb, dest + bytes_left,
> - (unsigned long)(iref + 1), len);
> +name_off, name_len);
>   if (eb != eb_in)
>   free_extent_buffer(eb);
> +
> + /* Ok, we have enough to find any refs to the parent inode. */
>   ret = inode_ref_info(parent, 0, fs_root, path, &found_key);
> - if (ret > 0)
> - ret = -ENOENT;
> - if (ret)
> - break;
>   next_inum = found_key.offset;
> + if (ret == 0) {
> + slot = path->slots[0];
> + eb = path->nodes[0];
> +

Re: [PATCH 0/3] btrfs: extended inode refs

2012-04-12 Thread Mark Fasheh
On Thu, Apr 12, 2012 at 12:11:13PM -0400, Chris Mason wrote:
> On Wed, Apr 11, 2012 at 03:11:46PM +0200, Jan Schmidt wrote:
> > Hi Jeff,
> > > 
> > >> An alternative solution to dealing with collisions could be to
> > >> emulate the dir-item insertion code - specifically something like
> > >> insert_with_overflow() which will stuff multiple items under one
> > >> key. I tend to prefer the idea of
> > > 
> > > I vote for this option.
> 
> [ Big patch series, thanks Mark! ]
> 
> I prefer the insert_with_overflow because it makes the deletion case
> less complex.  If we handle collisions with bits in the offset, we have
> to search around in the tree to find the key that was actually used to
> insert the item.
> 
> The insert_with_overflow code uses just one key, at the cost of having
> to search around inside the item.

Yeah this actually turns out to be a bit easier to code as well. I'm taking
this approach.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] btrfs: extended inode refs

2012-04-12 Thread Chris Mason
On Wed, Apr 11, 2012 at 03:11:46PM +0200, Jan Schmidt wrote:
> Hi Jeff,
> > 
> >> An alternative solution to dealing with collisions could be to
> >> emulate the dir-item insertion code - specifically something like
> >> insert_with_overflow() which will stuff multiple items under one
> >> key. I tend to prefer the idea of
> > 
> > I vote for this option.

[ Big patch series, thanks Mark! ]

I prefer the insert_with_overflow because it makes the deletion case
less complex.  If we handle collisions with bits in the offset, we have
to search around in the tree to find the key that was actually used to
insert the item.

The insert_with_overflow code uses just one key, at the cost of having
to search around inside the item.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: droptree implementation

2012-04-12 Thread Arne Jansen
This is an implementation of snapshot deletion using the readahead
framework. Multiple snapshots can be deleted at once and the trees
are not enumerated sequentially but in parallel in many branches.
This way readahead can reorder the request to better utilize all
disks. For a more detailed description see inline comments.

Signed-off-by: Arne Jansen 
---
 fs/btrfs/Makefile   |2 +-
 fs/btrfs/droptree.c | 1916 +++
 2 files changed, 1917 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 0c4fa2b..620d7c8 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o ulist.o
+  reada.o backref.o ulist.o droptree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/droptree.c b/fs/btrfs/droptree.c
new file mode 100644
index 000..9bc9c23
--- /dev/null
+++ b/fs/btrfs/droptree.c
@@ -0,0 +1,1916 @@
+/*
+ * Copyright (C) 2011 STRATO.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "locking.h"
+#include "free-space-cache.h"
+#include "print-tree.h"
+
+/*
+ * This implements snapshot deletions with the use of the readahead framework.
+ * The fs-tree is not tranversed in sequential (key-) order, but by descending
+ * into multiple paths at once. In addition, up to DROPTREE_MAX_ROOTS snapshots
+ * will be deleted in parallel.
+ * The basic principle is as follows:
+ * When a tree node is found, first its refcnt and flags are fetched (via
+ * readahead) from the extent allocation tree. Simplified, if the refcnt
+ * is > 1, other trees also reference this node and we also have to drop our
+ * ref to it and are done. If on the other hand the refcnt is 1, it's our
+ * node and we have to free it (the same holds for data extents). So we fetch
+ * the actual node (via readahead) and add all nodes/extents it points to to
+ * the queue to again fetch refcnts for them.
+ * While the case where refcnt > 1 looks like the easy one, there's one special
+ * case to take into account: If the node still uses non-shared refs and we
+ * are the owner of the node, we're not allowed to just drop our ref, as it
+ * would leave the node with an unresolvable backref (it points to our tree).
+ * So we first have to convert the refs to shared refs, and not only for this
+ * node, but for its full subtree. We can stop descending if we encounter a
+ * node of which we are not owner.
+ * One big difference to the old snapshot deletion code that sequentially walks
+ * the tree is that we can't represent the current deletion state with a single
+ * key. As we delete in several paths simultaneously, we have to record all
+ * loose ends in the state. To not get an exponentially growing state, we don't
+ * delete the refs top-down, but bottom-up. In addition, we restrict the
+ * currently processed nodes per-level globally. This way, we have a bounded
+ * size for the state and can preallocate the needed space before any work
+ * starts. During each transction commit, we write the state to a special
+ * inode (DROPTREE_INO) recorded in the root tree.
+ * For a commit, all outstanding readahead-requests get cancelled and moved
+ * to a restart queue.
+ *
+ * The central data structure here is the droptree_node (dn). It represents a
+ * file system extent, meaning a tree node, tree leaf or a data extent. The
+ * dn's are organized in a tree corresponding to the disk structure. Each dn
+ * keeps a bitmap that records which of its children are finished. When the
+ * last bit gets set by a child, the freeing of the node is triggered. In
+ * addition, struct droptree_root represents a fs tree to delete. It is mainly
+ * used to keep all roots in a list. The reada_control for this tree is also
+ * recorded here. We don't keep it in the dn's, as it is being freed and re-
+ * created with each transaction commit.
+ */
+
+#define DROPTREE_MAX_ROO

[PATCH 1/5] btrfs: extend readahead interface

2012-04-12 Thread Arne Jansen
This extends the readahead interface with callbacks. The old readahead
behaviour is now moved into a callback that is used by default if no
other callback is given. For a detailed description of the callbacks
see the inline comments in reada.c.
It also fixes some cases where the hook has not been called. This is
not a problem with the default callback, as it just cut some branches
from readahead. With the callback mechanism, we want a guaranteed
delivery.
This patch also makes readaheads hierarchical. A readahead can have
sub-readaheads. The idea is that the content of one tree can trigger
readaheads to other trees.
Also added is a function to cancel all outstanding requests for a
given readahead and all its sub-readas.
As the interface changes slightly, scrub has been edited to reflect
the changes.

Signed-off-by: Arne Jansen 
---
 fs/btrfs/ctree.h |   37 -
 fs/btrfs/reada.c |  481 ++
 fs/btrfs/scrub.c |   29 ++--
 3 files changed, 420 insertions(+), 127 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8e4457e..52b8a91 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3020,6 +3020,13 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 
devid,
 struct btrfs_scrub_progress *progress);
 
 /* reada.c */
+#undef READA_DEBUG
+struct reada_extctl;
+struct reada_control;
+typedef void (*reada_cb_t)(struct btrfs_root *root, struct reada_control *rc,
+  u64 wanted_generation, struct extent_buffer *eb,
+  u64 start, int err, struct btrfs_key *top,
+  void *ctx);
 struct reada_control {
struct btrfs_root   *root;  /* tree to prefetch */
struct btrfs_keykey_start;
@@ -3027,12 +3034,34 @@ struct reada_control {
atomic_telems;
struct kref refcnt;
wait_queue_head_t   wait;
+   struct reada_control*parent;
+   reada_cb_t  callback;
+#ifdef READA_DEBUG
+   int not_first;
+#endif
 };
-struct reada_control *btrfs_reada_add(struct btrfs_root *root,
- struct btrfs_key *start, struct btrfs_key *end);
-int btrfs_reada_wait(void *handle);
+struct reada_control *btrfs_reada_alloc(struct reada_control *parent,
+   struct btrfs_root *root,
+   struct btrfs_key *key_start, struct btrfs_key *key_end,
+   reada_cb_t callback);
+int btrfs_reada_add(struct reada_control *parent,
+   struct btrfs_root *root,
+   struct btrfs_key *key_start, struct btrfs_key *key_end,
+   reada_cb_t callback, void *ctx,
+   struct reada_control **rcp);
+int btrfs_reada_wait(struct reada_control *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
 u64 start, int err);
-
+int reada_add_block(struct reada_control *rc, u64 logical,
+  struct btrfs_key *top, int level, u64 generation, void *ctx);
+void reada_control_elem_get(struct reada_control *rc);
+void reada_control_elem_put(struct reada_control *rc);
+void reada_start_machine(struct btrfs_fs_info *fs_info);
+int btrfs_reada_abort(struct btrfs_fs_info *fs_info, struct reada_control *rc);
+
+/* droptree.c */
+int btrfs_droptree_pause(struct btrfs_fs_info *fs_info);
+void btrfs_droptree_continue(struct btrfs_fs_info *fs_info);
+void droptree_drop_list(struct btrfs_fs_info *fs_info, struct list_head *list);
 #endif
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 2373b39..0d88163 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -27,18 +27,18 @@
 #include "volumes.h"
 #include "disk-io.h"
 #include "transaction.h"
-
-#undef DEBUG
+#include "locking.h"
 
 /*
  * This is the implementation for the generic read ahead framework.
  *
  * To trigger a readahead, btrfs_reada_add must be called. It will start
- * a read ahead for the given range [start, end) on tree root. The returned
+ * a readahead for the given range [start, end) on tree root. The returned
  * handle can either be used to wait on the readahead to finish
  * (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
+ * If no return pointer is given, the readahead is started in the background.
  *
- * The read ahead works as follows:
+ * The readahead works as follows:
  * On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
  * reada_start_machine will then search for extents to prefetch and trigger
  * some reads. When a read finishes for a node, all contained node/leaf
@@ -52,6 +52,27 @@
  * Any number of readaheads can be started in parallel. The read order will be
  * determined globally, i.e. 2 parallel readaheads will normally finish faster
  * than the 2 started one after another.
+ *
+ * In addition

[PATCH 2/5] btrfs: add droptree inode

2012-04-12 Thread Arne Jansen
This adds a new special inode, the droptree inode. It is placed in
the tree root and is used to store the state of snapshot deletion.
Even if multiple snapshots are deleted at once, the full state is
stored within this one inode. After snapshot deletion completes,
the inode is left in place, but truncated to zero.
This patch also exports free_space_cache's io_ctl functions to
droptree and adds functions to store and read u8, u16, u32 and u64
values, as well as byte arrays of arbitrary length.

Signed-off-by: Arne Jansen 
---
 fs/btrfs/btrfs_inode.h  |4 +
 fs/btrfs/ctree.h|6 ++
 fs/btrfs/free-space-cache.c |  131 ---
 fs/btrfs/free-space-cache.h |   32 +++
 fs/btrfs/inode.c|3 +-
 5 files changed, 155 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..8abbed4 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -196,6 +196,10 @@ static inline void btrfs_i_size_write(struct inode *inode, 
u64 size)
 static inline bool btrfs_is_free_space_inode(struct btrfs_root *root,
   struct inode *inode)
 {
+   if (BTRFS_I(inode)->location.objectid == BTRFS_DROPTREE_INO_OBJECTID)
+   /* it also lives in the tree_root, but is no free space
+* inode */
+   return false;
if (root == root->fs_info->tree_root ||
BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID)
return true;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 52b8a91..e187ab9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -116,6 +116,12 @@ struct btrfs_ordered_sum;
  */
 #define BTRFS_FREE_INO_OBJECTID -12ULL
 
+/*
+ * The inode number assigned to the special inode for storing
+ * snapshot deletion progress
+ */
+#define BTRFS_DROPTREE_INO_OBJECTID -13ULL
+
 /* dummy objectid represents multiple objectids */
 #define BTRFS_MULTIPLE_OBJECTIDS -255ULL
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index b30242f..7e993b0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -259,19 +259,8 @@ static int readahead_cache(struct inode *inode)
return 0;
 }
 
-struct io_ctl {
-   void *cur, *orig;
-   struct page *page;
-   struct page **pages;
-   struct btrfs_root *root;
-   unsigned long size;
-   int index;
-   int num_pages;
-   unsigned check_crcs:1;
-};
-
-static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode,
-  struct btrfs_root *root)
+int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode,
+   struct btrfs_root *root)
 {
memset(io_ctl, 0, sizeof(struct io_ctl));
io_ctl->num_pages = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
@@ -286,12 +275,12 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct 
inode *inode,
return 0;
 }
 
-static void io_ctl_free(struct io_ctl *io_ctl)
+void io_ctl_free(struct io_ctl *io_ctl)
 {
kfree(io_ctl->pages);
 }
 
-static void io_ctl_unmap_page(struct io_ctl *io_ctl)
+void io_ctl_unmap_page(struct io_ctl *io_ctl)
 {
if (io_ctl->cur) {
kunmap(io_ctl->page);
@@ -300,7 +289,7 @@ static void io_ctl_unmap_page(struct io_ctl *io_ctl)
}
 }
 
-static void io_ctl_map_page(struct io_ctl *io_ctl, int clear)
+void io_ctl_map_page(struct io_ctl *io_ctl, int clear)
 {
WARN_ON(io_ctl->cur);
BUG_ON(io_ctl->index >= io_ctl->num_pages);
@@ -312,7 +301,7 @@ static void io_ctl_map_page(struct io_ctl *io_ctl, int 
clear)
memset(io_ctl->cur, 0, PAGE_CACHE_SIZE);
 }
 
-static void io_ctl_drop_pages(struct io_ctl *io_ctl)
+void io_ctl_drop_pages(struct io_ctl *io_ctl)
 {
int i;
 
@@ -327,8 +316,8 @@ static void io_ctl_drop_pages(struct io_ctl *io_ctl)
}
 }
 
-static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode,
-   int uptodate)
+int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode,
+int uptodate)
 {
struct page *page;
gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
@@ -361,6 +350,108 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, 
struct inode *inode,
return 0;
 }
 
+void io_ctl_set_bytes(struct io_ctl *io_ctl, void *data, unsigned long len)
+{
+   unsigned long l;
+
+   while (len) {
+   if (io_ctl->cur == NULL)
+   io_ctl_map_page(io_ctl, 1);
+   l = min(len, io_ctl->size);
+   memcpy(io_ctl->cur, data, l);
+   if (len != l) {
+   io_ctl_unmap_page(io_ctl);
+   } else {
+   io_ctl->cur += l;
+   io_ctl->size -= l;
+   }
+   data += l;
+   len -= l;
+   }
+}
+
+void io_ctl_get_bytes(struct io_ct

[PATCH 5/5] btrfs: use droptree for snapshot deletion

2012-04-12 Thread Arne Jansen
Update btrfs_clean_old_snapshots to make use of droptree. Snapshots
with old backrefs and snapshots which deletion is always in progress
are deleted with the old code, all other snapshots deletions use
droptree.
Some droptree-related debug code is also added to reada.c.

Signed-off-by: Arne Jansen 
---
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/reada.c   |   13 +
 fs/btrfs/transaction.c |   35 +--
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7b3ddd7..b175cfa 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3041,6 +3041,7 @@ int close_ctree(struct btrfs_root *root)
btrfs_pause_balance(root->fs_info);
 
btrfs_scrub_cancel(root);
+   btrfs_droptree_pause(fs_info);
 
/* wait for any defraggers to finish */
wait_event(fs_info->transaction_wait,
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 0d88163..69a9409 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -1112,6 +1112,19 @@ int btrfs_reada_wait(struct reada_control *rc)
dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0);
printk(KERN_DEBUG "reada_wait on %p: %d elems\n", rc,
atomic_read(&rc->elems));
+   mutex_lock(&fs_info->droptree_lock);
+
+   for (i = 0; i < BTRFS_MAX_LEVEL; ++i) {
+   if (fs_info->droptree_req[i] == 0)
+   continue;
+   printk(KERN_DEBUG "droptree req on level %d: %ld out "
+   "of %ld, queue is %sempty\n",
+   i, fs_info->droptree_req[i],
+   fs_info->droptree_limit[i],
+   list_empty(&fs_info->droptree_queue[i]) ?
+   "" : "not ");
+   }
+   mutex_unlock(&fs_info->droptree_lock);
}
 
dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 04b77e3..2d72a7e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1182,6 +1182,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
spin_unlock(&cur_trans->commit_lock);
wake_up(&root->fs_info->transaction_blocked_wait);
 
+   ret = btrfs_droptree_pause(root->fs_info);
+   BUG_ON(ret);
+
spin_lock(&root->fs_info->trans_lock);
if (cur_trans->list.prev != &root->fs_info->trans_list) {
prev_trans = list_entry(cur_trans->list.prev,
@@ -1363,6 +1366,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
trace_btrfs_transaction_commit(root);
 
btrfs_scrub_continue(root);
+   btrfs_droptree_continue(root->fs_info);
 
if (current->journal_info == trans)
current->journal_info = NULL;
@@ -1381,12 +1385,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 int btrfs_clean_old_snapshots(struct btrfs_root *root)
 {
LIST_HEAD(list);
+   LIST_HEAD(new);
struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_root_item *root_item = &root->root_item;
 
spin_lock(&fs_info->trans_lock);
list_splice_init(&fs_info->dead_roots, &list);
spin_unlock(&fs_info->trans_lock);
 
+   /*
+* in a first pass, pick out all snapshot deletions that have been
+* interrupted from a previous mount on an older kernel that didn't
+* support the droptree version of snapshot deletion. We continue
+* it with the old code. Also deletions of roots from very old
+* filesystems with old-style backrefs will be handled by the old
+* code
+*/
while (!list_empty(&list)) {
root = list_entry(list.next, struct btrfs_root, root_list);
list_del(&root->root_list);
@@ -1394,10 +1408,27 @@ int btrfs_clean_old_snapshots(struct btrfs_root *root)
btrfs_kill_all_delayed_nodes(root);
 
if (btrfs_header_backref_rev(root->node) <
-   BTRFS_MIXED_BACKREF_REV)
+   BTRFS_MIXED_BACKREF_REV) {
btrfs_drop_snapshot(root, NULL, 0, 0);
-   else
+   } else if (btrfs_disk_key_objectid(&root_item->drop_progress)) {
btrfs_drop_snapshot(root, NULL, 1, 0);
+   } else {
+   /* put on list for processing by droptree */
+   list_add_tail(&root->root_list, &new);
+   }
}
+
+   droptree_drop_list(fs_info, &new);
+   while (!list_empty(&new)) {
+   /*
+* if there are any roots left on the list after drop_list,
+* delete them with the old code. This can happen in when the
+* fs hasn't got enough space for the droptree 

[PATCH 0/5] btrfs: snapshot deletion via readahead

2012-04-12 Thread Arne Jansen
This patchset reimplements snapshot deletion with the help of the readahead
framework. For this callbacks are added to the framework. The main idea is
to traverse many snapshots at once at read many branches at once. This way
readahead get many requests at once (currently about 5), giving it the
chance to order disk accesses properly. On a single disk, the effect is
currently spoiled by sync operations that still take place, mainly checksum
deletion. The most benefit can be gained with multiple devices, as all devices
can be fully utilized. It scales quite well with the number of devices.
For more details see the commit messages of the individual patches and the
source code comments.

How it is tested:
I created a test volume using David Sterba's stress-subvol-git-aging.sh. It
checks out randoms version of the kernel git tree, creating a snapshot from it
from time to time and checks out other versions there, and so on. In the end
the fs had 80 subvols with various degrees of sharing between them. The
following tests were conducted on it:
 - delete a subvol using droptree and check the fs with btrfsck afterwards
   for consistency
 - delete all subvols and verify with btrfs-debug-tree that the extent
   allocation tree is clean
 - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get
   a good pressure on locking
 - add various degrees of memory pressure to the previous test to get pages
   to expire early from page cache
 - enable all relevant kernel debugging options during all tests

The performance gain on a single drive was about 20%, on 8 drives about 600%.
It depends vastly on the maximum parallelity of the readahead, that is
currently hardcoded to about 5. This number is subject to 2 factors, the
available RAM and the size of the saved state for a commit. As the full state
has to be saved on commit, a large parallelity leads to a large state.

Based on this I'll see if I can add delayed checksum deletions and running
the delayed refs via readahead, to gain a maximum ordering of I/O ops.

This patchset is also available at

git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree

Arne Jansen (5):
  btrfs: extend readahead interface
  btrfs: add droptree inode
  btrfs: droptree structures and initialization
  btrfs: droptree implementation
  btrfs: use droptree for snapshot deletion

 fs/btrfs/Makefile   |2 +-
 fs/btrfs/btrfs_inode.h  |4 +
 fs/btrfs/ctree.h|   78 ++-
 fs/btrfs/disk-io.c  |   19 +
 fs/btrfs/droptree.c | 1916 +++
 fs/btrfs/free-space-cache.c |  131 +++-
 fs/btrfs/free-space-cache.h |   32 +
 fs/btrfs/inode.c|3 +-
 fs/btrfs/reada.c|  494 +---
 fs/btrfs/scrub.c|   29 +-
 fs/btrfs/transaction.c  |   35 +-
 11 files changed, 2592 insertions(+), 151 deletions(-)
 create mode 100644 fs/btrfs/droptree.c

-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: droptree structures and initialization

2012-04-12 Thread Arne Jansen
Add the fs-global state and initialization for snapshot deletion

Signed-off-by: Arne Jansen 
---
 fs/btrfs/ctree.h   |   35 +++
 fs/btrfs/disk-io.c |   18 ++
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e187ab9..8eb0795 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1257,6 +1257,41 @@ struct btrfs_fs_info {
 
/* next backup root to be overwritten */
int backup_root_index;
+
+   /*
+* global state for snapshot deletion via readahead. All fields are
+* protected by droptree_lock. droptree_lock is a mutex and not a
+* spinlock as allocations are done inside and we don't want to use
+* atomic allocations unless we really have to.
+*/
+   struct mutex droptree_lock;
+
+   /*
+* currently running requests (droptree_nodes) for each level and
+* the corresponding limits. It's necessary to limit them to have
+* an upper limit on the state that has to be written with each
+* commit. All nodes exceeding the limit are enqueued to droptree_queue.
+*/
+   long droptree_req[BTRFS_MAX_LEVEL + 1];
+   long droptree_limit[BTRFS_MAX_LEVEL + 1];
+   struct list_head droptree_queue[BTRFS_MAX_LEVEL + 1];
+
+   /*
+* when droptree is paused, all currently running requests are moved
+* to droptree_restart. All nodes in droptree_queue are moved to
+* droptree_requeue
+*/
+   struct list_head droptree_restart;
+   struct list_head droptree_requeue;
+
+   /*
+* synchronization for pause/restart. droptree_rc is the top-level
+* reada_control, used to cancel all running requests
+*/
+   int droptrees_running;
+   int droptree_pause_req;
+   wait_queue_head_t droptree_wait;
+   struct reada_control *droptree_rc;
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b801d29..7b3ddd7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1911,6 +1911,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
int err = -EINVAL;
int num_backups_tried = 0;
int backup_index = 0;
+   int i;
 
extent_root = fs_info->extent_root =
kzalloc(sizeof(struct btrfs_root), GFP_NOFS);
@@ -1989,6 +1990,23 @@ struct btrfs_root *open_ctree(struct super_block *sb,
INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
spin_lock_init(&fs_info->reada_lock);
 
+   /* snapshot deletion state */
+   mutex_init(&fs_info->droptree_lock);
+   fs_info->droptree_pause_req = 0;
+   fs_info->droptrees_running = 0;
+   for (i = 0; i < BTRFS_MAX_LEVEL; ++i) {
+   fs_info->droptree_limit[i] = 100;
+   fs_info->droptree_req[i] = 0;
+   INIT_LIST_HEAD(fs_info->droptree_queue + i);
+   }
+   /* FIXME calculate some sane values, maybe based on avail RAM */
+   fs_info->droptree_limit[0] = 4;
+   fs_info->droptree_limit[1] = 1;
+   fs_info->droptree_limit[2] = 4000;
+   INIT_LIST_HEAD(&fs_info->droptree_restart);
+   INIT_LIST_HEAD(&fs_info->droptree_requeue);
+   init_waitqueue_head(&fs_info->droptree_wait);
+
fs_info->thread_pool_size = min_t(unsigned long,
  num_online_cpus() + 2, 8);
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: extended inode refs

2012-04-12 Thread Jan Schmidt
Hi Mark,

While reading 3/3 I stumbled across one more thing in this one:

On 05.04.2012 22:09, Mark Fasheh wrote:
> +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
> +   u64 start_off, struct btrfs_path *path,
> +   struct btrfs_inode_extref **ret_ref, u64 *found_off)
> +{
> + int ret, slot;
> + struct btrfs_key key, found_key;
> + struct btrfs_inode_extref *ref;
> + struct extent_buffer *leaf;
> + struct btrfs_item *item;
> + unsigned long ptr;
>  
> -/*
> - * There are a few corners where the link count of the file can't
> - * be properly maintained during replay.  So, instead of adding
> - * lots of complexity to the log code, we just scan the backrefs
> - * for any file that has been through replay.
> - *
> - * The scan will update the link count on the inode to reflect the
> - * number of back refs found.  If it goes down to zero, the iput
> - * will free the inode.
> - */
> -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
> -struct btrfs_root *root,
> -struct inode *inode)
> + key.objectid = inode_objectid;
> + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY);
> + key.offset = start_off;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + if (ret < 0)
> + goto out;
> +
> + while (1) {
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> + if (slot >= btrfs_header_nritems(leaf)) {
> + /*
> +  * If the item at offset is not found,
> +  * btrfs_search_slot will point us to the slot
> +  * where it should be inserted. In our case
> +  * that will be the slot directly before the
> +  * next INODE_REF_KEY_V2 item. In the case
> +  * that we're pointing to the last slot in a
> +  * leaf, we must move one leaf over.
> +  */
> + ret = btrfs_next_leaf(root, path);
> + if (ret) {
> + if (ret >= 1)
> + ret = -ENOENT;
> + break;
> + }
> + continue;
> + }
> +
> + item = btrfs_item_nr(leaf, slot);
> + btrfs_item_key_to_cpu(leaf, &found_key, slot);
> +
> + /*
> +  * Check that we're still looking at an extended ref key for
> +  * this particular objectid. If we have different
> +  * objectid or type then there are no more to be found
> +  * in the tree and we can exit.
> +  */
> + ret = -ENOENT;
> + if (found_key.objectid != inode_objectid)
> + break;
> + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY)
> + break;
> +
> + ret = 0;
> + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> + ref = (struct btrfs_inode_extref *)ptr;
> + *ret_ref = ref;
> + if (found_off)
> + *found_off = found_key.offset + 1;
  ^^^
It's evil to call it "found offset" an then return one larger than the
offset found. No caller would ever expect this.

> + break;
> + }
> +
> +out:
> + return ret;
> +}
> +
> +static int count_inode_extrefs(struct btrfs_root *root,
> +struct inode *inode, struct btrfs_path *path)
> +{
> + int ret;
> + unsigned int nlink = 0;
> + u64 inode_objectid = btrfs_ino(inode);
> + u64 offset = 0;
> + struct btrfs_inode_extref *ref;
> +
> + while (1) {
> + ret = btrfs_find_one_extref(root, inode_objectid, offset, path,
> + &ref, &offset);
> + if (ret)
> + break;
> +
> + nlink++;
> + offset++;

Huh. See? The caller expected to get the offset found from
btrfs_find_one_extref. As it stands you might be missing the very next key.

> + }
> +
> + return nlink;
> +}

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot speed/mount time regression with 3.4.0-rc2

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 09:37:48AM -0400, Josef Bacik wrote:
> On Thu, Apr 12, 2012 at 11:22:51AM +0200, Ahmet Inan wrote:
> > On Wed, Apr 11, 2012 at 7:04 PM, Josef Bacik  wrote:
> > > On Wed, Apr 11, 2012 at 05:26:29PM +0200, Ahmet Inan wrote:
> > >> On Tue, Apr 10, 2012 at 5:16 PM, Josef Bacik  wrote:
> > >> > On Mon, Apr 09, 2012 at 05:20:46PM -0400, Calvin Walton wrote:
> > >> >> On Mon, 2012-04-09 at 16:54 -0400, Josef Bacik wrote:
> > >> >> > On Mon, Apr 09, 2012 at 01:10:04PM -0400, Calvin Walton wrote:
> > >> >> > > On Mon, 2012-04-09 at 11:53 -0400, Calvin Walton wrote:
> > >> >> > > > Hi,
> > >> >> > > >
> > >> >> > > > I have a system that's using a dracut-generated initramfs to 
> > >> >> > > > mount a
> > >> >> > > > btrfs root. After upgrading to kernel 3.4.0-rc2 to test it out, 
> > >> >> > > > I've
> > >> >> > > > noticed that the process of mounting the root filesystem takes 
> > >> >> > > > much
> > >> >> > > > longer with 3.4.0-rc2 than it did with 3.3.1 - nearly 30 
> > >> >> > > > seconds slower!
> > >> >>
> > >> >> > > And the bisect results are in:
> > >> >> > > 285ff5af6ce358e73f53b55c9efadd4335f4c2ff is the first bad commit
> > >> >> > > commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff
> > >> >> > > Author: Josef Bacik 
> > >> >> > > Date:   Fri Jan 13 15:27:45 2012 -0500
> > >> >> > >
> > >> >> > >     Btrfs: remove the ideal caching code>
> > >> >> >
> > >> >> > Ok can you give this a whirl?  You are going to have to boot/reboot 
> > >> >> > a few times
> > >> >> > to let the cache get re-generated again to make sure it's taken 
> > >> >> > effect, but
> > >> >> > hopefully this will help out.  Thanks,
> > >> >>
> > >> >> Unfortunately, it doesn't seem to help. Even after 3 or 4 reboots with
> > >> >> this patch applied I'm still seeing the same delay.
> > >> >>
> > >> >
> > >> > Ok drop that previous patch and give this one a whirl, it helped on my 
> > >> > laptop.
> > >> > This is only  half of the problem AFAICS, but it's the easier half to 
> > >> > fix, in
> > >> > the meantime I need to lock down why we're not writing out cache for a 
> > >> > bunch of
> > >> > block groups, but thats trickier since the messages I need are spit 
> > >> > out while
> > >> > I'm shutting down, so I need to get creative.  Let me know if/how much 
> > >> > this
> > >> > helps.  Thanks,
> > >>
> > >> i have tried your patch and my system still needs several minutes to boot
> > >> until it can be used.
> > >> Also tried to reboot several times - it doesn't look like its getting 
> > >> better.
> > >> The last thing the system does when its shutting down is a read-only
> > >> remount of "/" so no umount.
> > >> Booting was much faster before i pulled for-linus a few weeks ago but
> > >> i couldn't find the time to bisect it yet ..
> > >>
> > >> please also look at the attached dmesg.txt.
> > >> this is an core i3 system with 2x2TB BTRFS RAID1 and lots of
> > >> home directories and snapshots.
> > >>
> > >> I'm going to test this patch on twenty more computers but with
> > >> smaller HDDs and less files and see if it helps to speed up their
> > >> boot times.
> > >>
> > >
> > > Ok looks like you are running into a different problem.  Could you maybe 
> > > run
> > > bootchart and upload the resulting png somewhere so I can look and see 
> > > what all
> > > is running while you boot?  Thanks,
> > 
> > http://aam.mathematik.uni-freiburg.de/IAM/homepages/ainan/bootchart.png
> > 
> > i have tried your patch now on the twenty more computers i mentioned and
> > still it takes a minute to remount rw "/" on those, even after several 
> > reboots.
> > 
> 
> Oops responding to the whole list this time..
> 
> Um ouch your system appears to not be doing anything for like 300 seconds but
> sitting there.  Can you hook up a console and capture sysrq+w while thats 
> going
> on?  Also you are mounting with -o space_cache right?  Can I see your dmesg to
> make sure it's doing what it's supposed to?  Thanks,
> 

Ok you don't actually have space_cache enabled it looks like, make sure to add
space_cache to your fstab so it gets enabled, and then reboot a few times to
make sure everything gets cached right and then it should help.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: make i_generation a u64

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 08:46:14AM +0200, Marco Stornelli wrote:
> 2012/4/11 Josef Bacik :
> > Btrfs stores generation numbers as 64bit numbers, which means we have to
> > carry around a u64 in our incore inode in addition to setting i_generation.
> > So convert to a u64 so btrfs can kill it's incore generation.  Thanks,
> >
> > Signed-off-by: Josef Bacik 
> > ---
> >  include/linux/fs.h |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9be896d..40564e0 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -831,7 +831,7 @@ struct inode {
> >                struct cdev             *i_cdev;
> >        };
> >
> > -       __u32                   i_generation;
> > +       u64                     i_generation;
> >
> >  #ifdef CONFIG_FSNOTIFY
> >        __u32                   i_fsnotify_mask; /* all events this inode 
> > cares about */
> > --
> > 1.7.7.6
> 
> This patch can have several impact on other fs. Only to do an example
> you can see the code in ioctl of ext4. I haven't got study the code
> but the ioctl returns a long, but on 32bit system, long means 4 bytes,
> so how we can return i_generation with ioctl?

So looking through everybody I'd have to convert a bunch of cpu_to_le32 to
(u32)cpu_to_le64 and some other such crap which I'm not terribly interested in
doing, I don't care _that_ much about saving 8 bytes, so I'll just drop this for
now.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot speed/mount time regression with 3.4.0-rc2

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 11:22:51AM +0200, Ahmet Inan wrote:
> On Wed, Apr 11, 2012 at 7:04 PM, Josef Bacik  wrote:
> > On Wed, Apr 11, 2012 at 05:26:29PM +0200, Ahmet Inan wrote:
> >> On Tue, Apr 10, 2012 at 5:16 PM, Josef Bacik  wrote:
> >> > On Mon, Apr 09, 2012 at 05:20:46PM -0400, Calvin Walton wrote:
> >> >> On Mon, 2012-04-09 at 16:54 -0400, Josef Bacik wrote:
> >> >> > On Mon, Apr 09, 2012 at 01:10:04PM -0400, Calvin Walton wrote:
> >> >> > > On Mon, 2012-04-09 at 11:53 -0400, Calvin Walton wrote:
> >> >> > > > Hi,
> >> >> > > >
> >> >> > > > I have a system that's using a dracut-generated initramfs to 
> >> >> > > > mount a
> >> >> > > > btrfs root. After upgrading to kernel 3.4.0-rc2 to test it out, 
> >> >> > > > I've
> >> >> > > > noticed that the process of mounting the root filesystem takes 
> >> >> > > > much
> >> >> > > > longer with 3.4.0-rc2 than it did with 3.3.1 - nearly 30 seconds 
> >> >> > > > slower!
> >> >>
> >> >> > > And the bisect results are in:
> >> >> > > 285ff5af6ce358e73f53b55c9efadd4335f4c2ff is the first bad commit
> >> >> > > commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff
> >> >> > > Author: Josef Bacik 
> >> >> > > Date:   Fri Jan 13 15:27:45 2012 -0500
> >> >> > >
> >> >> > >     Btrfs: remove the ideal caching code>
> >> >> >
> >> >> > Ok can you give this a whirl?  You are going to have to boot/reboot a 
> >> >> > few times
> >> >> > to let the cache get re-generated again to make sure it's taken 
> >> >> > effect, but
> >> >> > hopefully this will help out.  Thanks,
> >> >>
> >> >> Unfortunately, it doesn't seem to help. Even after 3 or 4 reboots with
> >> >> this patch applied I'm still seeing the same delay.
> >> >>
> >> >
> >> > Ok drop that previous patch and give this one a whirl, it helped on my 
> >> > laptop.
> >> > This is only  half of the problem AFAICS, but it's the easier half to 
> >> > fix, in
> >> > the meantime I need to lock down why we're not writing out cache for a 
> >> > bunch of
> >> > block groups, but thats trickier since the messages I need are spit out 
> >> > while
> >> > I'm shutting down, so I need to get creative.  Let me know if/how much 
> >> > this
> >> > helps.  Thanks,
> >>
> >> i have tried your patch and my system still needs several minutes to boot
> >> until it can be used.
> >> Also tried to reboot several times - it doesn't look like its getting 
> >> better.
> >> The last thing the system does when its shutting down is a read-only
> >> remount of "/" so no umount.
> >> Booting was much faster before i pulled for-linus a few weeks ago but
> >> i couldn't find the time to bisect it yet ..
> >>
> >> please also look at the attached dmesg.txt.
> >> this is an core i3 system with 2x2TB BTRFS RAID1 and lots of
> >> home directories and snapshots.
> >>
> >> I'm going to test this patch on twenty more computers but with
> >> smaller HDDs and less files and see if it helps to speed up their
> >> boot times.
> >>
> >
> > Ok looks like you are running into a different problem.  Could you maybe run
> > bootchart and upload the resulting png somewhere so I can look and see what 
> > all
> > is running while you boot?  Thanks,
> 
> http://aam.mathematik.uni-freiburg.de/IAM/homepages/ainan/bootchart.png
> 
> i have tried your patch now on the twenty more computers i mentioned and
> still it takes a minute to remount rw "/" on those, even after several 
> reboots.
> 

Oops responding to the whole list this time..

Um ouch your system appears to not be doing anything for like 300 seconds but
sitting there.  Can you hook up a console and capture sysrq+w while thats going
on?  Also you are mounting with -o space_cache right?  Can I see your dmesg to
make sure it's doing what it's supposed to?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: use i_version instead of our own sequence

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 04:22:26PM +0300, Kasatkin, Dmitry wrote:
> On Mon, Apr 9, 2012 at 6:53 PM, Josef Bacik  wrote:
> > We've been keeping around the inode sequence number in hopes that somebody
> > would use it, but nobody uses it and people actually use i_version which
> > serves the same purpose, so use i_version where we used the incore inode's
> > sequence number and that way the sequence is updated properly across the
> > board, and not just in file write.  Thanks,
> >
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/btrfs_inode.h   |    3 ---
> >  fs/btrfs/delayed-inode.c |    4 ++--
> >  fs/btrfs/file.c          |    1 -
> >  fs/btrfs/inode.c         |    5 ++---
> >  fs/btrfs/super.c         |    2 +-
> >  5 files changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index 9b9b15f..3771b85 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -83,9 +83,6 @@ struct btrfs_inode {
> >         */
> >        u64 generation;
> >
> > -       /* sequence number for NFS changes */
> > -       u64 sequence;
> > -
> >        /*
> >         * transid of the trans_handle that last modified this inode
> >         */
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index 03e3748..bcd40c7 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -1706,7 +1706,7 @@ static void fill_stack_inode_item(struct 
> > btrfs_trans_handle *trans,
> >        btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
> >        btrfs_set_stack_inode_generation(inode_item,
> >                                         BTRFS_I(inode)->generation);
> > -       btrfs_set_stack_inode_sequence(inode_item, 
> > BTRFS_I(inode)->sequence);
> > +       btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
> >        btrfs_set_stack_inode_transid(inode_item, trans->transid);
> >        btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
> >        btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> > @@ -1754,7 +1754,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
> >        set_nlink(inode, btrfs_stack_inode_nlink(inode_item));
> >        inode_set_bytes(inode, btrfs_stack_inode_nbytes(inode_item));
> >        BTRFS_I(inode)->generation = 
> > btrfs_stack_inode_generation(inode_item);
> > -       BTRFS_I(inode)->sequence = btrfs_stack_inode_sequence(inode_item);
> > +       inode->i_version = btrfs_stack_inode_sequence(inode_item);
> >        inode->i_rdev = 0;
> >        *rdev = btrfs_stack_inode_rdev(inode_item);
> >        BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 431b565..f0da02b 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1404,7 +1404,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb 
> > *iocb,
> >                mutex_unlock(&inode->i_mutex);
> >                goto out;
> >        }
> > -       BTRFS_I(inode)->sequence++;
> >
> >        start_pos = round_down(pos, root->sectorsize);
> >        if (start_pos > i_size_read(inode)) {
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 7a084fb..7d3dd2f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -2510,7 +2510,7 @@ static void btrfs_read_locked_inode(struct inode 
> > *inode)
> >
> >        inode_set_bytes(inode, btrfs_inode_nbytes(leaf, inode_item));
> >        BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, 
> > inode_item);
> > -       BTRFS_I(inode)->sequence = btrfs_inode_sequence(leaf, inode_item);
> > +       inode->i_version = btrfs_inode_sequence(leaf, inode_item);
> >        inode->i_generation = BTRFS_I(inode)->generation;
> >        inode->i_rdev = 0;
> >        rdev = btrfs_inode_rdev(leaf, inode_item);
> > @@ -2594,7 +2594,7 @@ static void fill_inode_item(struct btrfs_trans_handle 
> > *trans,
> >
> >        btrfs_set_inode_nbytes(leaf, item, inode_get_bytes(inode));
> >        btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
> > -       btrfs_set_inode_sequence(leaf, item, BTRFS_I(inode)->sequence);
> > +       btrfs_set_inode_sequence(leaf, item, inode->i_version);
> >        btrfs_set_inode_transid(leaf, item, trans->transid);
> >        btrfs_set_inode_rdev(leaf, item, inode->i_rdev);
> >        btrfs_set_inode_flags(leaf, item, BTRFS_I(inode)->flags);
> > @@ -6884,7 +6884,6 @@ struct inode *btrfs_alloc_inode(struct super_block 
> > *sb)
> >        ei->root = NULL;
> >        ei->space_info = NULL;
> >        ei->generation = 0;
> > -       ei->sequence = 0;
> >        ei->last_trans = 0;
> >        ei->last_sub_trans = 0;
> >        ei->logged_trans = 0;
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 54e7ee9..ee1bb31 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -770,7 +770,7 @@ static int btrfs_fill_super(struct super_block *sb,
> >  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
> >      

Re: [PATCH] Btrfs: use i_version instead of our own sequence

2012-04-12 Thread Kasatkin, Dmitry
On Mon, Apr 9, 2012 at 6:53 PM, Josef Bacik  wrote:
> We've been keeping around the inode sequence number in hopes that somebody
> would use it, but nobody uses it and people actually use i_version which
> serves the same purpose, so use i_version where we used the incore inode's
> sequence number and that way the sequence is updated properly across the
> board, and not just in file write.  Thanks,
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/btrfs_inode.h   |    3 ---
>  fs/btrfs/delayed-inode.c |    4 ++--
>  fs/btrfs/file.c          |    1 -
>  fs/btrfs/inode.c         |    5 ++---
>  fs/btrfs/super.c         |    2 +-
>  5 files changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 9b9b15f..3771b85 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -83,9 +83,6 @@ struct btrfs_inode {
>         */
>        u64 generation;
>
> -       /* sequence number for NFS changes */
> -       u64 sequence;
> -
>        /*
>         * transid of the trans_handle that last modified this inode
>         */
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 03e3748..bcd40c7 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1706,7 +1706,7 @@ static void fill_stack_inode_item(struct 
> btrfs_trans_handle *trans,
>        btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
>        btrfs_set_stack_inode_generation(inode_item,
>                                         BTRFS_I(inode)->generation);
> -       btrfs_set_stack_inode_sequence(inode_item, BTRFS_I(inode)->sequence);
> +       btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
>        btrfs_set_stack_inode_transid(inode_item, trans->transid);
>        btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
>        btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> @@ -1754,7 +1754,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>        set_nlink(inode, btrfs_stack_inode_nlink(inode_item));
>        inode_set_bytes(inode, btrfs_stack_inode_nbytes(inode_item));
>        BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
> -       BTRFS_I(inode)->sequence = btrfs_stack_inode_sequence(inode_item);
> +       inode->i_version = btrfs_stack_inode_sequence(inode_item);
>        inode->i_rdev = 0;
>        *rdev = btrfs_stack_inode_rdev(inode_item);
>        BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 431b565..f0da02b 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1404,7 +1404,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
>                mutex_unlock(&inode->i_mutex);
>                goto out;
>        }
> -       BTRFS_I(inode)->sequence++;
>
>        start_pos = round_down(pos, root->sectorsize);
>        if (start_pos > i_size_read(inode)) {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7a084fb..7d3dd2f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2510,7 +2510,7 @@ static void btrfs_read_locked_inode(struct inode *inode)
>
>        inode_set_bytes(inode, btrfs_inode_nbytes(leaf, inode_item));
>        BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
> -       BTRFS_I(inode)->sequence = btrfs_inode_sequence(leaf, inode_item);
> +       inode->i_version = btrfs_inode_sequence(leaf, inode_item);
>        inode->i_generation = BTRFS_I(inode)->generation;
>        inode->i_rdev = 0;
>        rdev = btrfs_inode_rdev(leaf, inode_item);
> @@ -2594,7 +2594,7 @@ static void fill_inode_item(struct btrfs_trans_handle 
> *trans,
>
>        btrfs_set_inode_nbytes(leaf, item, inode_get_bytes(inode));
>        btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
> -       btrfs_set_inode_sequence(leaf, item, BTRFS_I(inode)->sequence);
> +       btrfs_set_inode_sequence(leaf, item, inode->i_version);
>        btrfs_set_inode_transid(leaf, item, trans->transid);
>        btrfs_set_inode_rdev(leaf, item, inode->i_rdev);
>        btrfs_set_inode_flags(leaf, item, BTRFS_I(inode)->flags);
> @@ -6884,7 +6884,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>        ei->root = NULL;
>        ei->space_info = NULL;
>        ei->generation = 0;
> -       ei->sequence = 0;
>        ei->last_trans = 0;
>        ei->last_sub_trans = 0;
>        ei->logged_trans = 0;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 54e7ee9..ee1bb31 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -770,7 +770,7 @@ static int btrfs_fill_super(struct super_block *sb,
>  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
>        sb->s_flags |= MS_POSIXACL;
>  #endif
> -
> +       sb->s_flags |= MS_I_VERSION;
>        err = open_ctree(sb, fs_devices, (char *)data);
>        if (err) {
>                printk("btrfs: open_ctree failed\n");
> --
> 1.7.5.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux

Re: [PATCH] fs: make i_generation a u64

2012-04-12 Thread Josef Bacik
On Thu, Apr 12, 2012 at 08:46:14AM +0200, Marco Stornelli wrote:
> 2012/4/11 Josef Bacik :
> > Btrfs stores generation numbers as 64bit numbers, which means we have to
> > carry around a u64 in our incore inode in addition to setting i_generation.
> > So convert to a u64 so btrfs can kill it's incore generation.  Thanks,
> >
> > Signed-off-by: Josef Bacik 
> > ---
> >  include/linux/fs.h |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9be896d..40564e0 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -831,7 +831,7 @@ struct inode {
> >                struct cdev             *i_cdev;
> >        };
> >
> > -       __u32                   i_generation;
> > +       u64                     i_generation;
> >
> >  #ifdef CONFIG_FSNOTIFY
> >        __u32                   i_fsnotify_mask; /* all events this inode 
> > cares about */
> > --
> > 1.7.7.6
> 
> This patch can have several impact on other fs. Only to do an example
> you can see the code in ioctl of ext4. I haven't got study the code
> but the ioctl returns a long, but on 32bit system, long means 4 bytes,
> so how we can return i_generation with ioctl?
> 

Thanks for pointing this out, I will fix ext4 up and look to see if anybody else
does something similar,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: extended inode refs

2012-04-12 Thread Jan Schmidt
On 05.04.2012 22:09, Mark Fasheh wrote:
> Teach tree-log.c about extended inode refs. In particular, we have to adjust
> the behavior of inode ref replay as well as log tree recovery to account for
> the existence of extended refs.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/tree-log.c |  320 +-
>  fs/btrfs/tree-log.h |4 +
>  2 files changed, 266 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 966cc74..d69b07a 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -23,6 +23,7 @@
>  #include "disk-io.h"
>  #include "locking.h"
>  #include "print-tree.h"
> +#include "backref.h"

So now tree-log.c includes backref.h and backref.c includes tree-log.h,
this is not a problem by itself, but I'd try to avoid such dependencies.
I'd put find_one_extref over to backref.c to solve this, also because
there it would be closer to inode_ref_info (which does the same for
INODE_REFs.

>  #include "compat.h"
>  #include "tree-log.h"
>  
> @@ -748,6 +749,7 @@ static noinline int backref_in_log(struct btrfs_root *log,
>  {
>   struct btrfs_path *path;
>   struct btrfs_inode_ref *ref;
> + struct btrfs_inode_extref *extref;
   ^^
:-)

>   unsigned long ptr;
>   unsigned long ptr_end;
>   unsigned long name_ptr;
> @@ -764,8 +766,24 @@ static noinline int backref_in_log(struct btrfs_root 
> *log,
>   if (ret != 0)
>   goto out;
>  
> - item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
>   ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
> +
> + if (key->type == BTRFS_INODE_EXTREF_KEY) {
> + extref = (struct btrfs_inode_extref *)ptr;
> +
> + found_name_len = btrfs_inode_extref_name_len(path->nodes[0],
> +  extref);
> + if (found_name_len == namelen) {
> + name_ptr = (unsigned long)&extref->name;
> + ret = memcmp_extent_buffer(path->nodes[0], name,
> +name_ptr, namelen);
> + if (ret == 0)
> + match = 1;
> + }
> + goto out;
> + }
> +
> + item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
>   ptr_end = ptr + item_size;
>   while (ptr < ptr_end) {
>   ref = (struct btrfs_inode_ref *)ptr;
> @@ -786,7 +804,6 @@ out:
>   return match;
>  }
>  
> -
>  /*
>   * replay one inode back reference item found in the log tree.
>   * eb, slot and key refer to the buffer and key found in the log tree.
> @@ -801,15 +818,20 @@ static noinline int add_inode_ref(struct 
> btrfs_trans_handle *trans,
> struct btrfs_key *key)
>  {
>   struct btrfs_inode_ref *ref;
> + struct btrfs_inode_extref *extref;
>   struct btrfs_dir_item *di;
> + struct btrfs_key search_key;

You don't need search_key, just use key from the parameter list as the
code did previously.

>   struct inode *dir;
>   struct inode *inode;
>   unsigned long ref_ptr;
>   unsigned long ref_end;
> - char *name;
> - int namelen;
> + char *name, *victim_name;
> + int namelen, victim_name_len;

split

>   int ret;
>   int search_done = 0;
> + int log_ref_ver = 0;
> + u64 parent_objectid, inode_objectid, ref_index;

split

> + struct extent_buffer *leaf;
>  
>   /*
>* it is possible that we didn't log all the parent directories
> @@ -817,32 +839,56 @@ static noinline int add_inode_ref(struct 
> btrfs_trans_handle *trans,
>* copy the back ref in.  The link count fixup code will take
>* care of the rest
>*/
> - dir = read_one_inode(root, key->offset);
> +
> + if (key->type == BTRFS_INODE_EXTREF_KEY) {
> + log_ref_ver = 1;
^^^
Assigned but never used.

> +
> + ref_ptr = btrfs_item_ptr_offset(eb, slot);
> +
> + /* So that we don't loop back looking for old style log refs. */
> + ref_end = ref_ptr;
> +
> + extref = (struct btrfs_inode_extref *) 
> btrfs_item_ptr_offset(eb, slot);
> + namelen = btrfs_inode_extref_name_len(eb, extref);
> + name = kmalloc(namelen, GFP_NOFS);

kmalloc may fail.

> +
> + read_extent_buffer(eb, name, (unsigned long)&extref->name,
> +namelen);
> +
> + ref_index = btrfs_inode_extref_index(eb, extref);
> + parent_objectid = btrfs_inode_extref_parent(eb, extref);
> + } else {
> + parent_objectid = key->offset;
> +
> + ref_ptr = btrfs_item_ptr_offset(eb, slot);
> + ref_end = ref_ptr + btrfs_item_size_nr(eb, slot);
> +
> + ref = (struct btrfs_inode_ref *)ref_ptr;
> + namelen = b

Re: [PATCH 1/3] btrfs: extended inode refs

2012-04-12 Thread Jan Schmidt
On 05.04.2012 22:09, Mark Fasheh wrote:
> This patch adds basic support for extended inode refs. This includes support
> for link and unlink of the refs, which basically gets us support for rename
> as well.
> 
> Inode creation does not need changing - extended refs are only added after
> the ref array is full.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/ctree.h  |   50 +--
>  fs/btrfs/inode-item.c |  244 
> +++--
>  fs/btrfs/inode.c  |   20 ++--
>  3 files changed, 288 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 80b6486..5fc77ee 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -143,6 +143,13 @@ struct btrfs_ordered_sum;
>   */
>  #define BTRFS_NAME_LEN 255
>  
> +/*
> + * Theoretical limit is larger, but we keep this down to a sane
> + * value. That should limit greatly the possibility of collisions on
> + * inode ref items.
> + */
> +#define BTRFS_LINK_MAX 65535U

Do we really want an artificial limit like that?

> +
>  /* 32 bytes in various csum fields */
>  #define BTRFS_CSUM_SIZE 32
>  
> @@ -462,13 +469,16 @@ struct btrfs_super_block {
>  #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS  (1ULL << 2)
>  #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO  (1ULL << 3)
>  
> +#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6)
> +
>  #define BTRFS_FEATURE_COMPAT_SUPP0ULL
>  #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL
>  #define BTRFS_FEATURE_INCOMPAT_SUPP  \
>   (BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
>BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL |\
>BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |  \
> -  BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO)
> +  BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO |  \
> +  BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
>  
>  /*
>   * A leaf is full of items. offset and size tell us where to find
> @@ -615,6 +625,14 @@ struct btrfs_inode_ref {
>   /* name goes here */
>  } __attribute__ ((__packed__));
>  
> +struct btrfs_inode_extref {
> + __le64 parent_objectid;
> + __le64 index;
> + __le16 name_len;
> + __u8   name[0];
> + /* name goes here */
> +} __attribute__ ((__packed__));
> +
>  struct btrfs_timespec {
>   __le64 sec;
>   __le32 nsec;
> @@ -1400,6 +1418,7 @@ struct btrfs_ioctl_defrag_range_args {
>   */
>  #define BTRFS_INODE_ITEM_KEY 1
>  #define BTRFS_INODE_REF_KEY  12
> +#define BTRFS_INODE_EXTREF_KEY   13
>  #define BTRFS_XATTR_ITEM_KEY 24
>  #define BTRFS_ORPHAN_ITEM_KEY48
>  /* reserve 2-15 close to the inode for later flexibility */
> @@ -1701,6 +1720,13 @@ BTRFS_SETGET_STACK_FUNCS(block_group_flags,
>  BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
>  BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
>  
> +/* struct btrfs_inode_extref */
> +BTRFS_SETGET_FUNCS(inode_extref_parent, struct btrfs_inode_extref,
> +parent_objectid, 64);
> +BTRFS_SETGET_FUNCS(inode_extref_name_len, struct btrfs_inode_extref,
> +name_len, 16);
> +BTRFS_SETGET_FUNCS(inode_extref_index, struct btrfs_inode_extref, index, 64);
> +
>  /* struct btrfs_inode_item */
>  BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 
> 64);
>  BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64);
> @@ -2791,12 +2817,12 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle 
> *trans,
>  struct btrfs_root *root,
>  const char *name, int name_len,
>  u64 inode_objectid, u64 ref_objectid, u64 *index);
> -struct btrfs_inode_ref *
> -btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
> - struct btrfs_root *root,
> - struct btrfs_path *path,
> - const char *name, int name_len,
> - u64 inode_objectid, u64 ref_objectid, int mod);
> +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans,
> +   struct btrfs_root *root,
> +   struct btrfs_path *path,
> +   const char *name, int name_len,
> +   u64 inode_objectid, u64 ref_objectid, int mod,
> +   u64 *ret_index);
>  int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans,
>struct btrfs_root *root,
>struct btrfs_path *path, u64 objectid);
> @@ -2804,6 +2830,16 @@ int btrfs_lookup_inode(struct btrfs_trans_handle 
> *trans, struct btrfs_root
>  *root, struct btrfs_path *path,
>  struct btrfs_key *location, int mod);
>  
> +struct btrfs_inode_extref *
> +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
> +   struct btrfs_root *root,
> +   struct b

Re: [PATCH RFC] Btrfs: improve space count for files with fragments

2012-04-12 Thread Chris Mason
On Thu, Apr 12, 2012 at 05:05:02PM +0800, Liu Bo wrote:
> Here is a simple scenario:
> 
> $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=20;sync
> $ btrfs fi df /mnt/btrfs
> 
> we get 20K used, but then
> 
> $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=4 seek=4 conv=notrunc;sync
> $ btrfs fi df /mnt/btrfs
> 
> we get 24K used.
> 
> Here is the problem, it is possible that a file with lots of fragments costs
> nearly double space than its i_size, like:
> 
> 0k  20k
> | --- extent --- |  turned to be on disk<---  extent --->  <-- A -->
>  | - A - |  | -- | | - |
>  1k  19k   20k + 18k = 38k
> 
> but what users want is  <---  extent --->  <-- A -->
> | --- | | -- | | - |
> 1k + 1k + 18k = 20k
> 
> 18K is wasted.

Thanks for working on this.  I'd prefer that when we create the bookend
extents, we just split the original (20K extent in your case) as long as
there are no other references.

That would allow us to fully free the parts that are no actually used.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 2/2] Btrfs: move over to use ->update_time

2012-04-12 Thread Kasatkin, Dmitry
On Thu, Apr 12, 2012 at 2:32 PM, David Sterba  wrote:
> On Thu, Apr 12, 2012 at 02:09:08PM +0300, Kasatkin, Dmitry wrote:
>> Where is it? Can you please point out?
>
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/16662
>
> the relevant part:
>
> -- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -770,7 +770,7 @@ static int btrfs_fill_super(struct super_block *sb,
>  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
>        sb->s_flags |= MS_POSIXACL;
>  #endif
> -
> +       sb->s_flags |= MS_I_VERSION;
>        err = open_ctree(sb, fs_devices, (char *)data);
>        if (err) {
>                printk("btrfs: open_ctree failed\n");

Fantastic.. I see...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 2/2] Btrfs: move over to use ->update_time

2012-04-12 Thread David Sterba
On Thu, Apr 12, 2012 at 02:09:08PM +0300, Kasatkin, Dmitry wrote:
> Where is it? Can you please point out?

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/16662

the relevant part:

-- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -770,7 +770,7 @@ static int btrfs_fill_super(struct super_block *sb,
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
 #endif
-
+   sb->s_flags |= MS_I_VERSION;
err = open_ctree(sb, fs_devices, (char *)data);
if (err) {
printk("btrfs: open_ctree failed\n");
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 2/2] Btrfs: move over to use ->update_time

2012-04-12 Thread Kasatkin, Dmitry
On Tue, Apr 10, 2012 at 10:48 AM, David Sterba  wrote:
> On Mon, Apr 09, 2012 at 11:16:05AM -0400, J. Bruce Fields wrote:
>> > Nobody yet, I'm going to send a patch shortly that will support this.  
>> > Thanks,
>>
>> Great.  It would also be far preferable if it was just always on (at
>> least by default) rather than requiring a mount option.
>
> [looks into Josef's patch] it is, no mount option needed.
>

Where is it? Can you please point out?

Thanks.
>
> david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: change integrity checker to support big blocks

2012-04-12 Thread Stefan Behrens
The integrity checker used to be coded for nodesize == leafsize ==
sectorsize == PAGE_CACHE_SIZE.
This is now changed to support sizes for nodesize and leafsize which are
N * PAGE_CACHE_SIZE.

Signed-off-by: Stefan Behrens 
---
Change v1-v2:
- Cast PAGE_CACHE_SIZE to size_t in line 1251 to get rid of a warning
  during 32 bit compilation.

 fs/btrfs/check-integrity.c |  561 
 1 file changed, 415 insertions(+), 146 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d986824..23d969c 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -104,8 +104,6 @@
 #define BTRFSIC_BLOCK_STACK_FRAME_MAGIC_NUMBER 20111300
 #define BTRFSIC_TREE_DUMP_MAX_INDENT_LEVEL (200 - 6)   /* in characters,
 * excluding " [...]" */
-#define BTRFSIC_BLOCK_SIZE PAGE_SIZE
-
 #define BTRFSIC_GENERATION_UNKNOWN ((u64)-1)
 
 /*
@@ -211,8 +209,9 @@ struct btrfsic_block_data_ctx {
u64 dev_bytenr; /* physical bytenr on device */
u32 len;
struct btrfsic_dev_state *dev;
-   char *data;
-   struct buffer_head *bh; /* do not use if set to NULL */
+   char **datav;
+   struct page **pagev;
+   void *mem_to_free;
 };
 
 /* This structure is used to implement recursion without occupying
@@ -244,6 +243,8 @@ struct btrfsic_state {
struct btrfs_root *root;
u64 max_superblock_generation;
struct btrfsic_block *latest_superblock;
+   u32 metablock_size;
+   u32 datablock_size;
 };
 
 static void btrfsic_block_init(struct btrfsic_block *b);
@@ -291,8 +292,10 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
 static int btrfsic_process_metablock(struct btrfsic_state *state,
 struct btrfsic_block *block,
 struct btrfsic_block_data_ctx *block_ctx,
-struct btrfs_header *hdr,
 int limit_nesting, int force_iodone_flag);
+static void btrfsic_read_from_block_data(
+   struct btrfsic_block_data_ctx *block_ctx,
+   void *dst, u32 offset, size_t len);
 static int btrfsic_create_link_to_next_block(
struct btrfsic_state *state,
struct btrfsic_block *block,
@@ -319,12 +322,13 @@ static void btrfsic_release_block_ctx(struct 
btrfsic_block_data_ctx *block_ctx);
 static int btrfsic_read_block(struct btrfsic_state *state,
  struct btrfsic_block_data_ctx *block_ctx);
 static void btrfsic_dump_database(struct btrfsic_state *state);
+static void btrfsic_complete_bio_end_io(struct bio *bio, int err);
 static int btrfsic_test_for_metadata(struct btrfsic_state *state,
-const u8 *data, unsigned int size);
+char **datav, unsigned int num_pages);
 static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state,
- u64 dev_bytenr, u8 *mapped_data,
- unsigned int len, struct bio *bio,
- int *bio_is_patched,
+ u64 dev_bytenr, char **mapped_datav,
+ unsigned int num_pages,
+ struct bio *bio, int *bio_is_patched,
  struct buffer_head *bh,
  int submit_bio_bh_rw);
 static int btrfsic_process_written_superblock(
@@ -376,7 +380,7 @@ static struct btrfsic_dev_state *btrfsic_dev_state_lookup(
 static void btrfsic_cmp_log_and_dev_bytenr(struct btrfsic_state *state,
   u64 bytenr,
   struct btrfsic_dev_state *dev_state,
-  u64 dev_bytenr, char *data);
+  u64 dev_bytenr);
 
 static struct mutex btrfsic_mutex;
 static int btrfsic_is_initialized;
@@ -652,7 +656,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
int pass;
 
BUG_ON(NULL == state);
-   selected_super = kmalloc(sizeof(*selected_super), GFP_NOFS);
+   selected_super = kzalloc(sizeof(*selected_super), GFP_NOFS);
if (NULL == selected_super) {
printk(KERN_INFO "btrfsic: error, kmalloc failed!\n");
return -1;
@@ -719,7 +723,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
 
num_copies =
btrfs_num_copies(&state->root->fs_info->mapping_tree,
-next_bytenr, PAGE_SIZE);
+next_bytenr, state->metablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
printk(KE

[PATCH] mkfs.btrfs on ARM

2012-04-12 Thread Csaba Tóth

Hi,

   I tried to run the mkfs.btrfs command on an AT91SAM9260 processor 
(ARM, at91).
The cross compiling was quite simple but it doesn't work on the target 
platform.

The mkfs.btrfs stopped with an assert and the filesystem wasn't mountable.

   I looked for solution on the internet. Some people noticed this 
issue too, but nobody has a patch about it.
I found an interesting information, that some ARM processor isn't able 
to access a word from an
halfword aligned location 
(http://www.aleph1.co.uk/chapter-10-arm-structured-alignment-faq).
The gcc consider to use the correct alignment but in some situation the 
problem is remaining.


   In the btrfs-progs there are a struct named extent_buffer. It has a 
data field which could contains a bunch
of unaligned structs. The gcc can't access the member fields of these 
structs correctly.
I found a workaround/solution that solve this problem. If I use the 
memcpy command instead of setting or getting

members directly, the problem is eliminated.

   I have written a patch which works correctly on my platform.
I have attached this patch to my letter.

Best regards,
   Csaba Tóth, Watt 22 Ltd.


diff -ur btrfs-progs-0.19.orig/ctree.c btrfs-progs-0.19/ctree.c
--- btrfs-progs-0.19.orig/ctree.c	2009-06-11 17:56:15.0 +0100
+++ btrfs-progs-0.19/ctree.c	2012-04-08 20:54:48.012221953 +0100
@@ -722,14 +722,14 @@
 	int mid;
 	int ret;
 	unsigned long offset;
-	struct btrfs_disk_key *tmp;
+	struct btrfs_disk_key tmp;
 
 	while(low < high) {
 		mid = (low + high) / 2;
 		offset = p + mid * item_size;
 
-		tmp = (struct btrfs_disk_key *)(eb->data + offset);
-		ret = btrfs_comp_keys(tmp, key);
+		memcpy(&tmp, eb->data + offset, sizeof(struct btrfs_disk_key));
+		ret = btrfs_comp_keys(&tmp, key);
 
 		if (ret < 0)
 			low = mid + 1;
diff -ur btrfs-progs-0.19.orig/ctree.h btrfs-progs-0.19/ctree.h
--- btrfs-progs-0.19.orig/ctree.h	2009-06-11 17:56:15.0 +0100
+++ btrfs-progs-0.19/ctree.h	2012-04-08 20:25:04.578695193 +0100
@@ -890,14 +890,17 @@
 {	\
 	unsigned long offset = (unsigned long)s;			\
 	type *p = (type *) (eb->data + offset);\
-	return le##bits##_to_cpu(p->member);\
+	u##bits tmp;			\
+	memcpy(&tmp, &(p->member), sizeof(tmp));			\
+	return le##bits##_to_cpu(tmp);	\
 }	\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 type *s, u##bits val)		\
 {	\
 	unsigned long offset = (unsigned long)s;			\
 	type *p = (type *) (eb->data + offset);\
-	p->member = cpu_to_le##bits(val);\
+	u##bits tmp = cpu_to_le##bits(val);\
+	memcpy(&(p->member), &tmp, sizeof(tmp));			\
 }
 
 #define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits)		\



Re: Details about compression and extents

2012-04-12 Thread David Sterba
On Thu, Apr 12, 2012 at 10:27:13AM +0200, Alexander Block wrote:
> 1. How is decided what to compress and what not? After a fast test
> with a 2g image file, I've looked into the extents of that file with
> find-new and it turned out that only some of the first extents were
> compressed. The file was simply copied with cp.

with -o compress mount, a file extent is directly sent to compression
and if this turns out to be inefficient (ie. compressed size is larger
than uncompressed), then the NOCOMPRESS per-inode flag is set and no
further compression is done on that file. Yes this is not ideal and
this should be more adaptive than just bailing out completely after one
incompressible extent.

with -o compress-force the NOCOMPRESS flag is never set, so it may lead
to the other extreme that extents are compressed but it does not reduce
size at all.

> 2. I compared the extents of that mentioned file from a non-compressed
> fs and from a compressed fs and see much more and much smaller extents
> in the compressed fs. Does compression affect extent allocation or is
> this just a coincidence?

The extents size is limited to 128k right now, to reduce performance hit
when doing random reads from the middle. This also means more metadata
are needed for the extents.

> The source file was in use (VirtualBox was
> running) while I was copying it...if this has too much influence on
> extent allocation then please ignore the whole question.

This should not matter here.

> 3. How large are the blocks that get compressed? If it's dynamic, how
> is decided which size to use?

128k, fixed.

> 4. If there is no maximum on compressed extents size, is the whole
> extent compressed at once or in blocks?

currently in 4k blocks (where 4k == PAGE_CACHE_SIZE), though it also
depends on the compression method used. zlib is able to reuse the
dictionary for another block and the compression ratio is better at the
cost of speed.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot speed/mount time regression with 3.4.0-rc2

2012-04-12 Thread Ahmet Inan
On Wed, Apr 11, 2012 at 7:04 PM, Josef Bacik  wrote:
> On Wed, Apr 11, 2012 at 05:26:29PM +0200, Ahmet Inan wrote:
>> On Tue, Apr 10, 2012 at 5:16 PM, Josef Bacik  wrote:
>> > On Mon, Apr 09, 2012 at 05:20:46PM -0400, Calvin Walton wrote:
>> >> On Mon, 2012-04-09 at 16:54 -0400, Josef Bacik wrote:
>> >> > On Mon, Apr 09, 2012 at 01:10:04PM -0400, Calvin Walton wrote:
>> >> > > On Mon, 2012-04-09 at 11:53 -0400, Calvin Walton wrote:
>> >> > > > Hi,
>> >> > > >
>> >> > > > I have a system that's using a dracut-generated initramfs to mount a
>> >> > > > btrfs root. After upgrading to kernel 3.4.0-rc2 to test it out, I've
>> >> > > > noticed that the process of mounting the root filesystem takes much
>> >> > > > longer with 3.4.0-rc2 than it did with 3.3.1 - nearly 30 seconds 
>> >> > > > slower!
>> >>
>> >> > > And the bisect results are in:
>> >> > > 285ff5af6ce358e73f53b55c9efadd4335f4c2ff is the first bad commit
>> >> > > commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff
>> >> > > Author: Josef Bacik 
>> >> > > Date:   Fri Jan 13 15:27:45 2012 -0500
>> >> > >
>> >> > >     Btrfs: remove the ideal caching code>
>> >> >
>> >> > Ok can you give this a whirl?  You are going to have to boot/reboot a 
>> >> > few times
>> >> > to let the cache get re-generated again to make sure it's taken effect, 
>> >> > but
>> >> > hopefully this will help out.  Thanks,
>> >>
>> >> Unfortunately, it doesn't seem to help. Even after 3 or 4 reboots with
>> >> this patch applied I'm still seeing the same delay.
>> >>
>> >
>> > Ok drop that previous patch and give this one a whirl, it helped on my 
>> > laptop.
>> > This is only  half of the problem AFAICS, but it's the easier half to fix, 
>> > in
>> > the meantime I need to lock down why we're not writing out cache for a 
>> > bunch of
>> > block groups, but thats trickier since the messages I need are spit out 
>> > while
>> > I'm shutting down, so I need to get creative.  Let me know if/how much this
>> > helps.  Thanks,
>>
>> i have tried your patch and my system still needs several minutes to boot
>> until it can be used.
>> Also tried to reboot several times - it doesn't look like its getting better.
>> The last thing the system does when its shutting down is a read-only
>> remount of "/" so no umount.
>> Booting was much faster before i pulled for-linus a few weeks ago but
>> i couldn't find the time to bisect it yet ..
>>
>> please also look at the attached dmesg.txt.
>> this is an core i3 system with 2x2TB BTRFS RAID1 and lots of
>> home directories and snapshots.
>>
>> I'm going to test this patch on twenty more computers but with
>> smaller HDDs and less files and see if it helps to speed up their
>> boot times.
>>
>
> Ok looks like you are running into a different problem.  Could you maybe run
> bootchart and upload the resulting png somewhere so I can look and see what 
> all
> is running while you boot?  Thanks,

http://aam.mathematik.uni-freiburg.de/IAM/homepages/ainan/bootchart.png

i have tried your patch now on the twenty more computers i mentioned and
still it takes a minute to remount rw "/" on those, even after several reboots.

Ahmet
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] Btrfs: improve space count for files with fragments

2012-04-12 Thread Liu Bo
Here is a simple scenario:

$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=20;sync
$ btrfs fi df /mnt/btrfs

we get 20K used, but then

$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=4 seek=4 conv=notrunc;sync
$ btrfs fi df /mnt/btrfs

we get 24K used.

Here is the problem, it is possible that a file with lots of fragments costs
nearly double space than its i_size, like:

0k  20k
| --- extent --- |  turned to be on disk<---  extent --->  <-- A -->
 | - A - |  | -- | | - |
 1k  19k   20k + 18k = 38k

but what users want is  <---  extent --->  <-- A -->
| --- | | -- | | - |
1k + 1k + 18k = 20k

18K is wasted.

With the current backref design, it is really hard to fix this except a format
change.
So my choice is to pick up a special inline backref to indicates how much
space the extent costs, and the benifit is that it does not need to touch
the backref design too much.

a) When we do random write on the extent, we'll update the space of the extent
   by updating the special inline backref.

b) When we free the extent, we'll get the right space recorded in the special
   inline bakcref to update the space count info.

c) Besides, we are forbidden to add the range to the free space in such cases:

   | --- extent ---|
   | - A - |

   this part, A, includes the _start_ of the extent.
   We know that our data checksum item is taking this _start_ as a index,
   so just leave A's space where it is, do not free it.

NOTE:
   This has compatability issue, so please use this on a fresh-build btrfs.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h   |4 +
 fs/btrfs/delayed-ref.c |   30 +-
 fs/btrfs/extent-tree.c |  148 +++-
 fs/btrfs/extent_io.c   |   14 +
 fs/btrfs/extent_io.h   |5 ++
 fs/btrfs/file.c|  145 +--
 fs/btrfs/inode.c   |   19 +-
 fs/btrfs/relocation.c  |7 ++
 8 files changed, 334 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5b8ef8e..6d66a23 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2972,6 +2972,10 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 
start, u64 end,
 extern const struct file_operations btrfs_file_operations;
 int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
   u64 start, u64 end, u64 *hint_byte, int drop_cache);
+int btrfs_return_space_to_free_space(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root, u64 disk_bytenr,
+   u64 num_bytes, u64 owner, u64 drop_start,
+   u64 num_dec);
 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
  struct inode *inode, u64 start, u64 end);
 int btrfs_release_file(struct inode *inode, struct file *file);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 69f22e3..8a2640a 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -59,6 +59,8 @@ static int comp_data_refs(struct btrfs_delayed_data_ref *ref2,
  struct btrfs_delayed_data_ref *ref1)
 {
if (ref1->node.type == BTRFS_EXTENT_DATA_REF_KEY) {
+   if (ref2->root == 0 && ref1->root == 0)
+   return 0;
if (ref1->root < ref2->root)
return -1;
if (ref1->root > ref2->root)
@@ -355,6 +357,19 @@ update_existing_ref(struct btrfs_trans_handle *trans,
 * reference count
 */
existing->ref_mod += update->ref_mod;
+
+   /* for ref_root is 0 */
+   {
+   struct btrfs_delayed_data_ref *full_existing_ref;
+   struct btrfs_delayed_data_ref *full_ref;
+
+   full_existing_ref =
+  btrfs_delayed_node_to_data_ref(existing);
+   full_ref = btrfs_delayed_node_to_data_ref(update);
+
+   if (full_existing_ref->root == 0 && full_ref->root == 0)
+   full_existing_ref->offset += full_ref->offset;
+   }
}
 }
 
@@ -579,7 +594,10 @@ static noinline void add_delayed_data_ref(struct 
btrfs_fs_info *fs_info,
atomic_set(&ref->refs, 1);
ref->bytenr = bytenr;
ref->num_bytes = num_bytes;
-   ref->ref_mod = 1;
+   if (ref_root != 0)
+   ref->ref_mod = 1;
+   else
+   ref->ref_mod = 0;
ref->action = action;
ref->is_head = 0;
ref->in_tree = 1;
@@ -753,7 +771,15 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle 
*trans, u64 bytenr)
 

Re: Details about compression and extents

2012-04-12 Thread Alexander Block
One question that I forgot.

5. Does the offset in btrfs_file_extent_item point to the compressed
or uncompressed offset inside the extent? I would expect uncompressed
but I could also imagine that it's compressed in case extents are
compressed block wise.

On Thu, Apr 12, 2012 at 10:27 AM, Alexander Block
 wrote:
> Hello,
>
> I'm currently trying to understand how compression in btrfs works. I
> could not find any detailed description about it. So here are my
> questions.
>
> 1. How is decided what to compress and what not? After a fast test
> with a 2g image file, I've looked into the extents of that file with
> find-new and it turned out that only some of the first extents were
> compressed. The file was simply copied with cp.
> 2. I compared the extents of that mentioned file from a non-compressed
> fs and from a compressed fs and see much more and much smaller extents
> in the compressed fs. Does compression affect extent allocation or is
> this just a coincidence? The source file was in use (VirtualBox was
> running) while I was copying it...if this has too much influence on
> extent allocation then please ignore the whole question.
> 3. How large are the blocks that get compressed? If it's dynamic, how
> is decided which size to use?
> 4. If there is no maximum on compressed extents size, is the whole
> extent compressed at once or in blocks?
>
> More questions may follow...
>
> Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Details about compression and extents

2012-04-12 Thread Alexander Block
Hello,

I'm currently trying to understand how compression in btrfs works. I
could not find any detailed description about it. So here are my
questions.

1. How is decided what to compress and what not? After a fast test
with a 2g image file, I've looked into the extents of that file with
find-new and it turned out that only some of the first extents were
compressed. The file was simply copied with cp.
2. I compared the extents of that mentioned file from a non-compressed
fs and from a compressed fs and see much more and much smaller extents
in the compressed fs. Does compression affect extent allocation or is
this just a coincidence? The source file was in use (VirtualBox was
running) while I was copying it...if this has too much influence on
extent allocation then please ignore the whole question.
3. How large are the blocks that get compressed? If it's dynamic, how
is decided which size to use?
4. If there is no maximum on compressed extents size, is the whole
extent compressed at once or in blocks?

More questions may follow...

Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html