Re: What are thoses [btrfs-cache-nnn] kernel threads ?

2011-05-19 Thread Chester
>
> Out of curiosity, why isn't this done automatically as opposed to
> having to mount with the space_cache option?

The space_cache option changes the disk format. Once enabled, it will
be permanent. The mount option gives people an option of whether they
want to enable space_cache.. I've been using it, and it's pretty safe
to use.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What are thoses [btrfs-cache-nnn] kernel threads ?

2011-05-19 Thread Miguel Garrido
On Thu, May 19, 2011 at 2:26 PM, Josef Bacik  wrote:
>
> Yeah so this is a crappy thing about btrfs, we need to cache free space, so we
> have to run these threads to read the extent tree to put together the free 
> space
> cache.  You can get around this by moving to a new kernel and mounting with
>
> -o space_cache
>
> This will enable the space caching feature, so you will get those threads 
> once,
> but then every time after that it will be fast and you shouldn't see those
> threads at all.  Its a disk format change, so you only have to mount -o
> space_cache once and then it will be permament.  Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Out of curiosity, why isn't this done automatically as opposed to
having to mount with the space_cache option?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] Btrfs: introduce sub transaction stuff

2011-05-19 Thread liubo
On 05/20/2011 08:23 AM, Chris Mason wrote:
> Excerpts from Liu Bo's message of 2011-05-19 04:11:24 -0400:
>> Introduce a new concept "sub transaction",
>> the relation between transaction and sub transaction is
>>
>> transaction A   ---> transid = x
>>sub trans a(1)   ---> sub_transid = x+1
>>sub trans a(2)   ---> sub_transid = x+2
>>  ... ...
>>sub trans a(n-1) ---> sub_transid = x+n-1
>>sub trans a(n)   ---> sub_transid = x+n
>> transaction B   ---> transid = x+n+1
>>  ... ...
>>
>> And the most important is
>> a) a trans handler's transid now gets value from sub transid instead of 
>> transid.
>> b) when a transaction commits, transid may not added by 1, but depend on the
>>biggest sub_transaction of the last neighbour transaction,
>>i.e.
>> B->transid = a(n)->transid + 1,
>> (B->transid - A->transid) >= 1
>> c) we start a new sub transaction after a fsync.
>>
>> We also ship some 'trans->transid' to 'trans->transaction->transid' to
>> ensure btrfs works well and to get rid of WARNings.
>>
>> These are used for the new log code.
> 
> This is exactly what I had in mind.  I need to read it harder and make
> sure it interacts well with the directory logging code, but I love it.
> 
> Thanks!
> 

It's so great that you like it.  :)

But I must NOTE again:
   Due to the bug which patch 8 fixed, the previous preformance statistics I 
posted sometime ago, 
   like (*SPEED* : 4.7+ Mb/sec), are valueless and cannot be used as a basis 
any more...

Hope that more people can get the patchset tested.

thanks,
liubo

> -chris
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] Btrfs: introduce sub transaction stuff

2011-05-19 Thread Chris Mason
Excerpts from Liu Bo's message of 2011-05-19 04:11:24 -0400:
> Introduce a new concept "sub transaction",
> the relation between transaction and sub transaction is
> 
> transaction A   ---> transid = x
>sub trans a(1)   ---> sub_transid = x+1
>sub trans a(2)   ---> sub_transid = x+2
>  ... ...
>sub trans a(n-1) ---> sub_transid = x+n-1
>sub trans a(n)   ---> sub_transid = x+n
> transaction B   ---> transid = x+n+1
>  ... ...
> 
> And the most important is
> a) a trans handler's transid now gets value from sub transid instead of 
> transid.
> b) when a transaction commits, transid may not added by 1, but depend on the
>biggest sub_transaction of the last neighbour transaction,
>i.e.
> B->transid = a(n)->transid + 1,
> (B->transid - A->transid) >= 1
> c) we start a new sub transaction after a fsync.
> 
> We also ship some 'trans->transid' to 'trans->transaction->transid' to
> ensure btrfs works well and to get rid of WARNings.
> 
> These are used for the new log code.

This is exactly what I had in mind.  I need to read it harder and make
sure it interacts well with the directory logging code, but I love it.

Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd option for USB flash drive?

2011-05-19 Thread cwillu
On Thu, May 19, 2011 at 4:12 PM, Stephane Chazelas
 wrote:
> 2011-05-19 15:54:23 -0600, cwillu:
> [...]
>> Try with the "ssd_spread" mount option.
> [...]
>
> Thanks. I'll try that.
>
>> > I wonder now what credit to give to recommendations like in
>> > http://www.patriotmemory.com/forums/showthread.php?3696-HOWTO-Increase-write-speed-by-aligning-FAT32
>> > http://linux-howto-guide.blogspot.com/2009/10/increase-usb-flash-drive-write-speed.html
>> >
>> > Doing a apt-get upgrade on that stick takes hours when the same
>> > takes a few minutes on an internal drive.
>>
>> Also, there's a package "libeatmydata" which will provide an
>> "eatmydata" command, which you can prefix your apt-get commands with.
>> This will disable the excessive sync calls that dpkg makes, and should
>> dramatically decrease the time for those sorts of things to finish.
>> Flash as found in thumb drives doesn't have much in the way of crash
>> guarantees anyway, so you're not really giving up much safety.
>
> Thanks. That's very useful indeed.
>
> Note that if you use that on aptitude/apg-get that means that
> the daemons started/restarted in the process will be affected,
> but it could be all the better in my case.

Heh, that's a thought I hadn't actually considered :p

That shouldn't affect any services that are managed by message
passing, and so really should be limited to those services from
/etc/init.d/ that don't restart themselves (i.e., where the restart
command is implemented by stop + start rather than telling the already
running process to re-execute), or newly installed services that again
are managed via /etc/init.d/.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd option for USB flash drive?

2011-05-19 Thread Stephane Chazelas
2011-05-19 15:54:23 -0600, cwillu:
[...]
> Try with the "ssd_spread" mount option.
[...]

Thanks. I'll try that.

> > I wonder now what credit to give to recommendations like in
> > http://www.patriotmemory.com/forums/showthread.php?3696-HOWTO-Increase-write-speed-by-aligning-FAT32
> > http://linux-howto-guide.blogspot.com/2009/10/increase-usb-flash-drive-write-speed.html
> >
> > Doing a apt-get upgrade on that stick takes hours when the same
> > takes a few minutes on an internal drive.
> 
> Also, there's a package "libeatmydata" which will provide an
> "eatmydata" command, which you can prefix your apt-get commands with.
> This will disable the excessive sync calls that dpkg makes, and should
> dramatically decrease the time for those sorts of things to finish.
> Flash as found in thumb drives doesn't have much in the way of crash
> guarantees anyway, so you're not really giving up much safety.

Thanks. That's very useful indeed.

Note that if you use that on aptitude/apg-get that means that
the daemons started/restarted in the process will be affected,
but it could be all the better in my case.

Now, with that eatmydata, I'm thinking of trying qemu-nbd -c
/dev/nbd0 /dev/mapper/original-device with that and have the
rootfs mounted on that /dev/nbd0.

That eatmydata could be a work around to the problem I was
mentionning at
https://lists.ubuntu.com/archives/ubuntu-server-bugs/2010-June/037846.html

-- 
Stephane
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd option for USB flash drive?

2011-05-19 Thread cwillu
> [...]
>> aligning logical blocks to erase blocks can give some performance but the 
>> only
>> way to make it really fast is not to use USB
> [...]
>
> For something that fits in your pocket and is almost
> universally bootable, there are not so many other options.

An ssd drive in a USB enclosure is about the size of a cell phone,
just a thought.

> I tried changing the alignment on FAT32 and it didn't make
> any difference. Playing with /proc/sys/vm/block_dump, I could see
> chunks of 3, 4, 5 data sectors being written at once regardless
> of the cluster size being used anyway. Interestingly when a user
> process writes to /dev/sdx, block_dump shows 4k writes to
> /dev/sdx only regardless of the size of the user writes while if
> it goes via the filesystem I can see writes of up to 120k. Also,
> I've very little knowledge of what happens at layers below the
> block device (scsi interface, usb-storage, and the device
> controller itself, for instance, I see
> /sys/block/sdi/queue/rotational is 1 for that usb stick, why,
> what does that mean in terms of performance and scheduling of
> read-writes?)

Try with the "ssd_spread" mount option.

> I wonder now what credit to give to recommendations like in
> http://www.patriotmemory.com/forums/showthread.php?3696-HOWTO-Increase-write-speed-by-aligning-FAT32
> http://linux-howto-guide.blogspot.com/2009/10/increase-usb-flash-drive-write-speed.html
>
> Doing a apt-get upgrade on that stick takes hours when the same
> takes a few minutes on an internal drive.

Also, there's a package "libeatmydata" which will provide an
"eatmydata" command, which you can prefix your apt-get commands with.
This will disable the excessive sync calls that dpkg makes, and should
dramatically decrease the time for those sorts of things to finish.
Flash as found in thumb drives doesn't have much in the way of crash
guarantees anyway, so you're not really giving up much safety.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd option for USB flash drive?

2011-05-19 Thread Stephane Chazelas
2011-05-19 21:04:54 +0200, Hubert Kario:
> On Wednesday 18 of May 2011 00:02:52 Stephane Chazelas wrote:
> > Hiya,
> > 
> > I've not found much detail on what the "ssd" btrfs mount option
> > did. Would it make sense to enable it to a fs on a USB flash
> > drive?
> 
> yes, enabling discard is pointless though (no USB storage supports it AFAIK).
>  
> > I'm using btrfs (over LVM) on a Live Linux USB stick to benefit
> > from btrfs's compression and am trying to improve the
> > performance.
> 
> ssd mode won't improve performance by much (if any).
> 
> You need to remember that USB2.0 is limited to about 20-30MiB/s (depending on 
> CPU) so it will be slow no matter what you do

Thanks Hubert for the feedback.

Well, for hard drives over USB, I can get to 40MiB/s read and
write easily. Here, I believe the bottle neck is the flash
memory. With that particular USB flash drive Corsair Voyager GT
16GB, I can get 25MiB/s sequential read and 17MiB/s sequential
write, but that falls down to about 3-5MiB/s random write.

[...]
> aligning logical blocks to erase blocks can give some performance but the 
> only 
> way to make it really fast is not to use USB
[...]

For something that fits in your pocket and is almost
universally bootable, there are not so many other options.

I tried changing the alignment on FAT32 and it didn't make
any difference. Playing with /proc/sys/vm/block_dump, I could see
chunks of 3, 4, 5 data sectors being written at once regardless
of the cluster size being used anyway. Interestingly when a user
process writes to /dev/sdx, block_dump shows 4k writes to
/dev/sdx only regardless of the size of the user writes while if
it goes via the filesystem I can see writes of up to 120k. Also,
I've very little knowledge of what happens at layers below the
block device (scsi interface, usb-storage, and the device
controller itself, for instance, I see
/sys/block/sdi/queue/rotational is 1 for that usb stick, why,
what does that mean in terms of performance and scheduling of
read-writes?)

I wonder now what credit to give to recommendations like in
http://www.patriotmemory.com/forums/showthread.php?3696-HOWTO-Increase-write-speed-by-aligning-FAT32
http://linux-howto-guide.blogspot.com/2009/10/increase-usb-flash-drive-write-speed.html

Doing a apt-get upgrade on that stick takes hours when the same
takes a few minutes on an internal drive.

If I boot a kvm virtual machine on that USB stick with a disk
cache mode of "unsafe", that is writes are hardly every flushed
to underlying storage, then that becomes lightning fast (at the
expense of possibly losing data in case of host failure, but I'm
not too worried about that), and flushing writes to device
upon VM shutdown only takes a couple of minutes.

So I figured that if I could make sure writing to the flash
device is asynchronous (and reads priviledged), that would help.

There's probably some solutions with aufs or some fuse
solutions, but I thought there might be some solution in btrfs
or some standard core layers usually underneath it.

-- 
Stephane
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] vfs: allow /proc/pid/maps to return a custom device

2011-05-19 Thread Mark Fasheh
On Sat, May 14, 2011 at 08:06:04PM -0700, Eric W. Biederman wrote:
> Mark Fasheh  writes:
> 
> > This patch introduces a callback in the super_operations structure,
> > 'get_maps_dev' which is then used by procfs to query which device to return
> > for reporting in /proc/[PID]/maps.
> 
> No.
> 
> It may make sense to call the vfs stat method.  But introducing an extra
> vfs operations for this seems like a maintenance nightmare.

Yeah I'm not thrilled with the extra method either. My concern with using
->getattr is whether it's too heavy since that implies potential disk /
network i/o.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd option for USB flash drive?

2011-05-19 Thread Hubert Kario
Sorry, loks like list mailer doesn't like SMIME messages.

On Thursday 19 of May 2011 21:04:54 Hubert Kario wrote:
> On Wednesday 18 of May 2011 00:02:52 Stephane Chazelas wrote:
> > Hiya,
> > 
> > I've not found much detail on what the "ssd" btrfs mount option
> > did. Would it make sense to enable it to a fs on a USB flash
> > drive?
> 
> yes, enabling discard is pointless though (no USB storage supports it
> AFAIK).
> 
> > I'm using btrfs (over LVM) on a Live Linux USB stick to benefit
> > from btrfs's compression and am trying to improve the
> > performance.
> 
> ssd mode won't improve performance by much (if any).
> 
> You need to remember that USB2.0 is limited to about 20-30MiB/s (depending
> on CPU) so it will be slow no matter what you do
> 
> > Would anybody have any recommendation on how to improve
> > performance there? Like what would be the best way to
> > enable/increase writeback buffer or any way to make sure writes
> > are delayed and asynchronous? Would disabling read-ahead help?
> > (at which level would it be done?). Any other tip (like
> > disabling atime, aligning blocks/extents, figure out erase block
> > sizes if relevant...)?
> 
> aligning logical blocks to erase blocks can give some performance but the
> only way to make it really fast is not to use USB
> 
> > Many thanks in advance,
> > Stephane

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-19 Thread Josef Bacik
On Thu, May 19, 2011 at 01:03:18PM -0600, Andreas Dilger wrote:
> On May 19, 2011, at 11:58, Josef Bacik wrote:
> > Btrfs (and I'd venture most other fs's) stores its indexes in nice disk 
> > order
> > for readdir, but unfortunately in the case of anything that stats the files 
> > in
> > order that readdir spits back (like oh say ls) that means we still have to 
> > do
> > the normal lookup of the file, which means looking up our other index and 
> > then
> > looking up the inode.  What I want is a way to create dummy dentries when we
> > find them in readdir so that when ls or anything else subsequently does a
> > stat(), we already have the location information in the dentry and can go
> > straight to the inode itself.  The lookup stuff just assumes that if it 
> > finds a
> > dentry it is done, it doesn't perform a lookup.  So add a DCACHE_NEED_LOOKUP
> > flag so that the lookup code knows it still needs to run i_op->lookup() on 
> > the
> > parent to get the inode for the dentry.  I have tested this with btrfs and I
> > went from something that looks like this
> > 
> > http://people.redhat.com/jwhiter/ls-noreada.png
> > 
> > To this
> > 
> > http://people.redhat.com/jwhiter/ls-good.png
> > 
> > Thats a savings of 1300 seconds, or 22 minutes.  That is a significant 
> > savings.
> > Thanks,
> 
> This comment should probably mention the number of files being tested, in 
> order
> to make a 1300s savings meaningful.  Similarly, it would be better to provide
> the absolute times of tests in case these URLs disappear in the future.
> 
> "That reduces the time to do "ls -l" on a 1M file directory from 2181s to 
> 855s."
> 

Good point, I will include that in my next posting.

> > Signed-off-by: Josef Bacik 
> > ---
> > fs/namei.c |   48 
> > 
> > include/linux/dcache.h |1 +
> > 2 files changed, 49 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index e3c4f11..a1bff4f 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1198,6 +1198,29 @@ static struct dentry *d_alloc_and_lookup(struct 
> > dentry *parent,
> > }
> > 
> > /*
> > + * We already have a dentry, but require a lookup to be performed on the 
> > parent
> > + * directory to fill in d_inode. Returns the new dentry, or ERR_PTR on 
> > error.
> > + * parent->d_inode->i_mutex must be held. d_lookup must have verified that 
> > no
> > + * child exists while under i_mutex.
> > + */
> > +static struct dentry *d_inode_lookup(struct dentry *parent, struct dentry 
> > *dentry,
> > +struct nameidata *nd)
> > +{
> > +   struct inode *inode = parent->d_inode;
> > +   struct dentry *old;
> > +
> > +   /* Don't create child dentry for a dead directory. */
> > +   if (unlikely(IS_DEADDIR(inode)))
> > +   return ERR_PTR(-ENOENT);
> > +
> > +   old = inode->i_op->lookup(inode, dentry, nd);
> > +   if (unlikely(old)) {
> > +   dput(dentry);
> > +   dentry = old;
> > +   }
> > +   return dentry;
> > +}
> > +/*
> > *  It's more convoluted than I'd like it to be, but... it's still fairly
> > *  small and for now I'd prefer to have fast path as straight as possible.
> > *  It _is_ time-critical.
> > @@ -1236,6 +1259,13 @@ static int do_lookup(struct nameidata *nd, struct 
> > qstr *name,
> > goto unlazy;
> > }
> > }
> > +   if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> > +   if (nameidata_dentry_drop_rcu(nd, dentry))
> > +   return -ECHILD;
> > +   dput(dentry);
> > +   dentry = NULL;
> > +   goto retry;
> > +   }
> > path->mnt = mnt;
> > path->dentry = dentry;
> > if (likely(__follow_mount_rcu(nd, path, inode, false)))
> > @@ -1250,6 +1280,12 @@ unlazy:
> > }
> > } else {
> > dentry = __d_lookup(parent, name);
> > +   if (unlikely(!dentry))
> > +   goto retry;
> > +   if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> > +   dput(dentry);
> > +   dentry = NULL;
> > +   }
> > }
> > 
> > retry:
> > @@ -1268,6 +1304,18 @@ retry:
> > /* known good */
> > need_reval = 0;
> > status = 1;
> > +   } else if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> > +   struct dentry *old;
> > +
> > +   dentry->d_flags &= ~DCACHE_NEED_LOOKUP;
> > +   dentry = d_inode_lookup(parent, dentry, nd);
> 
> Would it make sense to keep DCACHE_NEED_LOOKUP set in d_flags until _after_
> the call to d_inode_lookup()?  That way the filesystem can positively know
> it is doing the inode lookup from d_fsdata, instead of just inferring it
> from the presence of d_fsdata?  It is already the filesystem that is setting
> D

Re: ssd option for USB flash drive?

2011-05-19 Thread Hubert Kario


smime.p7m
Description: S/MIME encrypted message


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-19 Thread Andreas Dilger
On May 19, 2011, at 11:58, Josef Bacik wrote:
> Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
> for readdir, but unfortunately in the case of anything that stats the files in
> order that readdir spits back (like oh say ls) that means we still have to do
> the normal lookup of the file, which means looking up our other index and then
> looking up the inode.  What I want is a way to create dummy dentries when we
> find them in readdir so that when ls or anything else subsequently does a
> stat(), we already have the location information in the dentry and can go
> straight to the inode itself.  The lookup stuff just assumes that if it finds 
> a
> dentry it is done, it doesn't perform a lookup.  So add a DCACHE_NEED_LOOKUP
> flag so that the lookup code knows it still needs to run i_op->lookup() on the
> parent to get the inode for the dentry.  I have tested this with btrfs and I
> went from something that looks like this
> 
> http://people.redhat.com/jwhiter/ls-noreada.png
> 
> To this
> 
> http://people.redhat.com/jwhiter/ls-good.png
> 
> Thats a savings of 1300 seconds, or 22 minutes.  That is a significant 
> savings.
> Thanks,

This comment should probably mention the number of files being tested, in order
to make a 1300s savings meaningful.  Similarly, it would be better to provide
the absolute times of tests in case these URLs disappear in the future.

"That reduces the time to do "ls -l" on a 1M file directory from 2181s to 855s."

> Signed-off-by: Josef Bacik 
> ---
> fs/namei.c |   48 
> include/linux/dcache.h |1 +
> 2 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index e3c4f11..a1bff4f 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1198,6 +1198,29 @@ static struct dentry *d_alloc_and_lookup(struct dentry 
> *parent,
> }
> 
> /*
> + * We already have a dentry, but require a lookup to be performed on the 
> parent
> + * directory to fill in d_inode. Returns the new dentry, or ERR_PTR on error.
> + * parent->d_inode->i_mutex must be held. d_lookup must have verified that no
> + * child exists while under i_mutex.
> + */
> +static struct dentry *d_inode_lookup(struct dentry *parent, struct dentry 
> *dentry,
> +  struct nameidata *nd)
> +{
> + struct inode *inode = parent->d_inode;
> + struct dentry *old;
> +
> + /* Don't create child dentry for a dead directory. */
> + if (unlikely(IS_DEADDIR(inode)))
> + return ERR_PTR(-ENOENT);
> +
> + old = inode->i_op->lookup(inode, dentry, nd);
> + if (unlikely(old)) {
> + dput(dentry);
> + dentry = old;
> + }
> + return dentry;
> +}
> +/*
> *  It's more convoluted than I'd like it to be, but... it's still fairly
> *  small and for now I'd prefer to have fast path as straight as possible.
> *  It _is_ time-critical.
> @@ -1236,6 +1259,13 @@ static int do_lookup(struct nameidata *nd, struct qstr 
> *name,
>   goto unlazy;
>   }
>   }
> + if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> + if (nameidata_dentry_drop_rcu(nd, dentry))
> + return -ECHILD;
> + dput(dentry);
> + dentry = NULL;
> + goto retry;
> + }
>   path->mnt = mnt;
>   path->dentry = dentry;
>   if (likely(__follow_mount_rcu(nd, path, inode, false)))
> @@ -1250,6 +1280,12 @@ unlazy:
>   }
>   } else {
>   dentry = __d_lookup(parent, name);
> + if (unlikely(!dentry))
> + goto retry;
> + if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> + dput(dentry);
> + dentry = NULL;
> + }
>   }
> 
> retry:
> @@ -1268,6 +1304,18 @@ retry:
>   /* known good */
>   need_reval = 0;
>   status = 1;
> + } else if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
> + struct dentry *old;
> +
> + dentry->d_flags &= ~DCACHE_NEED_LOOKUP;
> + dentry = d_inode_lookup(parent, dentry, nd);

Would it make sense to keep DCACHE_NEED_LOOKUP set in d_flags until _after_
the call to d_inode_lookup()?  That way the filesystem can positively know
it is doing the inode lookup from d_fsdata, instead of just inferring it
from the presence of d_fsdata?  It is already the filesystem that is setting
DCACHE_NEED_LOOKUP, so it should really be the one clearing this flag also.

I'm concerned that there may be filesystems that need d_fsdata for something
already, so the presence/absence of d_fsdata is not a clear indication to
the underlying filesystem of whether to do an inode lookup based on d_fsdata,
which mig

Re: What are thoses [btrfs-cache-nnn] kernel threads ?

2011-05-19 Thread Josef Bacik
On Wed, May 18, 2011 at 10:37:17PM -0400, Christian Robert wrote:
> hi,
> 
>  everyday at around 17:00, but today at 18:38 I start
>  a multithread job (a la make -j4) who rsync
>  257 huge directory from a remote host to my
>  machine (at kernel 2.6.39-rc7 since yesterday)
> 
> below is the load average on the machine
> 
> 2011-05-18 18:36 -> 0.00 0.11 0.13 1/218 1923
> 2011-05-18 18:37 -> 0.00 0.09 0.13 1/218 1923
> 2011-05-18 18:38 -> 0.42 0.17 0.15 1/240 2275  <- job started here
> 2011-05-18 18:39 -> 0.62 0.26 0.18 1/243 2287
> 2011-05-18 18:40 -> 1.78 0.64 0.32 1/244 2288
> 2011-05-18 18:41 -> 1.45 0.76 0.38 1/244 2297
> 2011-05-18 18:42 -> 1.94 1.02 0.49 1/244 2310
> 2011-05-18 18:43 -> 2.27 1.28 0.62 1/242 2325
> 2011-05-18 18:44 -> 2.39 1.45 0.71 1/247 2333
> 2011-05-18 18:45 -> 2.39 1.61 0.81 1/246 2359
> 2011-05-18 18:46 -> 3.22 1.95 0.98 1/248 2375
> 2011-05-18 18:47 -> 3.16 2.17 1.12 1/247 2386
> 2011-05-18 18:48 -> 3.40 2.39 1.25 1/249 2394
> 2011-05-18 18:49 -> 4.33 2.82 1.47 1/249 2395
> 2011-05-18 18:50 -> 3.93 3.01 1.63 1/246 2406
> 2011-05-18 18:51 -> 3.81 3.14 1.75 1/248 2410
> 2011-05-18 18:52 -> 3.92 3.32 1.90 1/247 2417
> 2011-05-18 18:53 -> 2.62 3.06 1.90 1/246 2431
> 2011-05-18 18:54 -> 3.52 3.22 2.03 1/248 2433
> 2011-05-18 18:55 -> 3.84 3.38 2.16 1/247 2439
> 2011-05-18 18:56 -> 2.91 3.26 2.20 1/245 2450
> 2011-05-18 18:57 -> 2.75 3.11 2.22 1/248 2457
> 2011-05-18 18:58 -> 3.59 3.29 2.33 1/248 2458
> 2011-05-18 18:59 -> 3.41 3.30 2.40 1/248 2468
> 2011-05-18 19:00 -> 3.76 3.40 2.49 1/247 2477
> 2011-05-18 19:01 -> 4.28 3.60 2.61 1/248 2488
> 2011-05-18 19:02 -> 4.06 3.67 2.70 1/247 2502
> 2011-05-18 19:03 -> 2.16 3.21 2.60 1/244 2515
> 2011-05-18 19:04 -> 2.81 3.17 2.62 1/247 2518
> 2011-05-18 19:05 -> 3.23 3.24 2.68 1/246 2522
> 2011-05-18 19:06 -> 3.53 3.32 2.74 1/246 2544
> 2011-05-18 19:07 -> 2.89 3.13 2.71 1/245 2550
> 2011-05-18 19:08 -> 3.68 3.28 2.79 1/247 2557
> 2011-05-18 19:09 -> 4.47 3.62 2.94 2/242 2571
> 2011-05-18 19:10 -> 3.20 3.39 2.90 1/244 2577
> 2011-05-18 19:11 -> 5.35 4.03 3.16 1/244 2584
> 2011-05-18 19:12 -> 5.07 4.20 3.26 3/245 2588
> 2011-05-18 19:13 -> 2.69 3.72 3.16 1/241 2602
> 2011-05-18 19:14 -> 3.37 3.70 3.19 1/244 2605
> 2011-05-18 19:15 -> 3.55 3.71 3.22 1/242 2611
> 2011-05-18 19:16 -> 2.94 3.48 3.18 1/243 2621
> 2011-05-18 19:17 -> 3.15 3.45 3.19 1/241 2630
> 2011-05-18 19:18 -> 3.02 3.37 3.17 1/242 2639
> 2011-05-18 19:19 -> 3.47 3.40 3.19 1/243 2649
> 2011-05-18 19:20 -> 3.71 3.46 3.23 1/243 2650
> 2011-05-18 19:21 -> 2.64 3.22 3.16 1/241 2664
> 2011-05-18 19:22 -> 3.48 3.31 3.19 1/244 2671
> 2011-05-18 19:23 -> 4.57 3.63 3.31 1/243 2675
> 2011-05-18 19:24 -> 4.06 3.66 3.34 1/242 2684
> 2011-05-18 19:25 -> 5.34 4.06 3.49 1/250 2699
> 2011-05-18 19:26 -> 6.18 4.57 3.71 1/244 2707
> 2011-05-18 19:27 -> 5.21 4.54 3.75 1/244 2711
> 2011-05-18 19:28 -> 4.17 4.37 3.74 1/244 2721
> 2011-05-18 19:29 -> 4.14 4.34 3.77 1/243 2728
> 2011-05-18 19:30 -> 4.52 4.40 3.82 1/243 2734
> 2011-05-18 19:31 -> 4.74 4.50 3.90 1/244 2743
> 2011-05-18 19:32 -> 5.09 4.60 3.97 3/244 2754
> 2011-05-18 19:33 -> 4.59 4.60 4.01 1/242 2765
> 2011-05-18 19:34 -> 4.39 4.53 4.02 1/243 2769
> 2011-05-18 19:35 -> 4.75 4.60 4.08 1/243 2773
> 2011-05-18 19:36 -> 4.98 4.66 4.13 2/245 2783
> 2011-05-18 19:37 -> 4.29 4.55 4.13 1/245 2797
> 2011-05-18 19:38 -> 7.16 5.27 4.39 1/243 2809
> 2011-05-18 19:39 -> 6.31 5.31 4.46 1/247 2815
> 2011-05-18 19:40 -> 6.29 5.40 4.54 1/251 2827
> 2011-05-18 19:41 -> 6.31 5.66 4.69 2/245 2838
> 2011-05-18 19:42 -> 4.43 5.25 4.61 1/243 2848
> 2011-05-18 19:43 -> 5.13 5.30 4.67 1/247 2856
> 2011-05-18 19:44 -> 5.39 5.31 4.71 1/243 2866
> 2011-05-18 19:45 -> 3.95 4.94 4.62 2/242 2874
> 2011-05-18 19:46 -> 4.59 4.99 4.66 1/243 2884
> 2011-05-18 19:47 -> 4.97 5.06 4.71 1/243 2889
> 2011-05-18 19:48 -> 5.51 5.25 4.79 1/245 2899
> 2011-05-18 19:49 -> 5.19 5.22 4.82 1/242 2905
> 2011-05-18 19:50 -> 5.21 5.18 4.82 2/246 2911
> 2011-05-18 19:51 -> 6.80 5.55 4.97 1/248 2918
> 2011-05-18 19:52 -> 6.11 5.67 5.05 1/241 2924
> 2011-05-18 19:53 -> 5.89 5.72 5.11 3/240 2938
> 2011-05-18 19:54 -> 3.96 5.20 4.96 1/241 2944
> 2011-05-18 19:55 -> 5.03 5.32 5.02 1/244 2952
> 2011-05-18 19:56 -> 4.81 5.22 5.01 1/243 2954
> 2011-05-18 19:57 -> 5.01 5.22 5.03 1/242 2958
> 2011-05-18 19:58 -> 4.82 5.15 5.02 1/242 2965
> 2011-05-18 19:59 -> 4.37 4.99 4.98 1/241 2978
> 2011-05-18 20:00 -> 4.93 4.98 4.97 3/243 2985
> 2011-05-18 20:01 -> 4.56 4.85 4.93 1/243 2995
> 2011-05-18 20:02 -> 4.41 4.76 4.89 2/243 3012
> 2011-05-18 20:03 -> 4.06 4.59 4.82 2/245 3022
> 2011-05-18 20:04 -> 3.94 4.47 4.77 2/243 3029
> 2011-05-18 20:05 -> 4.85 4.60 4.79 1/243 3034
> 2011-05-18 20:06 -> 4.90 4.71 4.82 1/242 3047
> 2011-05-18 20:07 -> 4.50 4.61 4.78 2/244 3055
> 2011-05-18 20:08 -> 4.06 4.48 4.72 2/245 3061
> 2011-05-18 20:09 -> 4.34 4.52 4.72 1/242 3070
> 2011-05-18 20:10 -> 4.95 4.63 4.75 1/246 3077
> 2011-05-18 20:11 -> 4.27 4.55 4.72 1/241 3089
> 2011-05-18 20:12 -> 4.44 4.58 4.72 1/241 309

[PATCH 2/2] Btrfs: load the key from the dir item in readdir into a fake dentry

2011-05-19 Thread Josef Bacik
In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir.  However if you use ls,
it usually stat's each file it gets from readdir.  This is where the second
index comes in, which is based on a hash of the name of the file.  So then the
lookup has to lookup this index, and then lookup the inode.  The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance.  Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c |   51 +--
 1 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6228a30..3cd246c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4120,12 +4120,23 @@ struct inode *btrfs_lookup_dentry(struct inode *dir, 
struct dentry *dentry)
struct btrfs_root *sub_root = root;
struct btrfs_key location;
int index;
-   int ret;
+   int ret = 0;
 
if (dentry->d_name.len > BTRFS_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
 
-   ret = btrfs_inode_by_name(dir, dentry, &location);
+   if (dentry->d_fsdata) {
+   memcpy(&location, dentry->d_fsdata, sizeof(struct btrfs_key));
+   kfree(dentry->d_fsdata);
+   dentry->d_fsdata = NULL;
+   /*
+* We need to unhash this dentry so we can rehash it when we
+* find the inode.
+*/
+   d_drop(dentry);
+   } else {
+   ret = btrfs_inode_by_name(dir, dentry, &location);
+   }
 
if (ret < 0)
return ERR_PTR(ret);
@@ -4180,6 +4191,12 @@ static int btrfs_dentry_delete(const struct dentry 
*dentry)
return 0;
 }
 
+static void btrfs_dentry_release(struct dentry *dentry)
+{
+   if (dentry->d_fsdata)
+   kfree(dentry->d_fsdata);
+}
+
 static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry,
   struct nameidata *nd)
 {
@@ -4206,6 +4223,7 @@ static int btrfs_real_readdir(struct file *filp, void 
*dirent,
struct btrfs_key key;
struct btrfs_key found_key;
struct btrfs_path *path;
+   struct qstr q;
int ret;
struct extent_buffer *leaf;
int slot;
@@ -4284,6 +4302,7 @@ static int btrfs_real_readdir(struct file *filp, void 
*dirent,
 
while (di_cur < di_total) {
struct btrfs_key location;
+   struct dentry *tmp;
 
if (verify_dir_item(root, leaf, di))
break;
@@ -4304,6 +4323,33 @@ static int btrfs_real_readdir(struct file *filp, void 
*dirent,
d_type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
btrfs_dir_item_key_to_cpu(leaf, di, &location);
 
+   q.name = name_ptr;
+   q.len = name_len;
+   q.hash = full_name_hash(q.name, q.len);
+   tmp = d_lookup(filp->f_dentry, &q);
+   if (!tmp) {
+   struct btrfs_key *newkey;
+
+   newkey = kzalloc(sizeof(struct btrfs_key),
+GFP_NOFS);
+   if (!newkey)
+   goto no_dentry;
+   tmp = d_alloc(filp->f_dentry, &q);
+   if (!tmp) {
+   kfree(newkey);
+   dput(tmp);
+   goto no_dentry;
+   }
+   memcpy(newkey, &location,
+  sizeof(struct btrfs_key));
+   tmp->d_fsdata = newkey;
+   tmp->d_flags |= DCACHE_NEED_LOOKUP;
+   d_rehash(tmp);
+   dput(tmp);
+   } else {
+   dput(tmp);
+   }
+no_dentry:
/* is this a reference to our own snapshot? If so
 * skip it
 */
@@ -7566,4 +7612,5 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
 
 const struct dentry_operations btrfs_dentry_operations = {
.d_delete   = btrfs_dentry_delete,
+   .d_release  = btrfs_dentry_release,
 };
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the

[PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-19 Thread Josef Bacik
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode.  What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself.  The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup.  So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry.  I have tested this with btrfs and I
went from something that looks like this

http://people.redhat.com/jwhiter/ls-noreada.png

To this

http://people.redhat.com/jwhiter/ls-good.png

Thats a savings of 1300 seconds, or 22 minutes.  That is a significant savings.
Thanks,

Signed-off-by: Josef Bacik 
---
 fs/namei.c |   48 
 include/linux/dcache.h |1 +
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e3c4f11..a1bff4f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1198,6 +1198,29 @@ static struct dentry *d_alloc_and_lookup(struct dentry 
*parent,
 }
 
 /*
+ * We already have a dentry, but require a lookup to be performed on the parent
+ * directory to fill in d_inode. Returns the new dentry, or ERR_PTR on error.
+ * parent->d_inode->i_mutex must be held. d_lookup must have verified that no
+ * child exists while under i_mutex.
+ */
+static struct dentry *d_inode_lookup(struct dentry *parent, struct dentry 
*dentry,
+struct nameidata *nd)
+{
+   struct inode *inode = parent->d_inode;
+   struct dentry *old;
+
+   /* Don't create child dentry for a dead directory. */
+   if (unlikely(IS_DEADDIR(inode)))
+   return ERR_PTR(-ENOENT);
+
+   old = inode->i_op->lookup(inode, dentry, nd);
+   if (unlikely(old)) {
+   dput(dentry);
+   dentry = old;
+   }
+   return dentry;
+}
+/*
  *  It's more convoluted than I'd like it to be, but... it's still fairly
  *  small and for now I'd prefer to have fast path as straight as possible.
  *  It _is_ time-critical.
@@ -1236,6 +1259,13 @@ static int do_lookup(struct nameidata *nd, struct qstr 
*name,
goto unlazy;
}
}
+   if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
+   if (nameidata_dentry_drop_rcu(nd, dentry))
+   return -ECHILD;
+   dput(dentry);
+   dentry = NULL;
+   goto retry;
+   }
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, false)))
@@ -1250,6 +1280,12 @@ unlazy:
}
} else {
dentry = __d_lookup(parent, name);
+   if (unlikely(!dentry))
+   goto retry;
+   if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
+   dput(dentry);
+   dentry = NULL;
+   }
}
 
 retry:
@@ -1268,6 +1304,18 @@ retry:
/* known good */
need_reval = 0;
status = 1;
+   } else if (unlikely(dentry->d_flags & DCACHE_NEED_LOOKUP)) {
+   struct dentry *old;
+
+   dentry->d_flags &= ~DCACHE_NEED_LOOKUP;
+   dentry = d_inode_lookup(parent, dentry, nd);
+   if (IS_ERR(dentry)) {
+   mutex_unlock(&dir->i_mutex);
+   return PTR_ERR(dentry);
+   }
+   /* known good */
+   need_reval = 0;
+   status = 1;
}
mutex_unlock(&dir->i_mutex);
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 19d90a5..a8b2457 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -216,6 +216,7 @@ struct dentry_operations {
 #define DCACHE_MOUNTED 0x1 /* is a mountpoint */
 #define DCACHE_NEED_AUTOMOUNT  0x2 /* handle automount on this dir */
 #define DCACHE_MANAGE_TRANSIT  0x4 /* manage transit from this dirent */
+#define DCACHE_NEED_LOOKUP 0x8 /* dentry requires i_op->lookup */
 #define DCACHE_MANAGED_DENTRY \
(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to maj

Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

2011-05-19 Thread Sunil Mushran

On 05/19/2011 02:13 AM, Tristan Ye wrote:

+   if (inode->i_size == 0 || *offset>= inode->i_size) {
+   ret = -ENXIO;
+   goto out_unlock;
+   }

Why not using if (*offset>= inode->i_size) directly?


duh!


+   BUG_ON(cpos<  le32_to_cpu(rec.e_cpos));
A same assert has already been performed inside ocfs2_get_clusters_nocache(),
does it make sense to do it again here?


good catch


+
+   if ((!is_data&&  origin == SEEK_HOLE) ||
+   (is_data&&  origin == SEEK_DATA)) {
+   if (extoff>  *offset)
+   *offset = extoff;
+   goto out_unlock;

Seems above logic is going to stop at the first time we find a hole.

How about the offset was within the range of a hole already when we doing
SEEK_HOLE, shouldn't we proceed detecting until the next hole gets found, whose
start_offset was greater than supplied offset, according to semantics described
by the the header of this patch, should it be like following?

if (extoff>  *offset) {
*offset = extoff;
goto out_unlock;
}


So if the offset is in a hole, then we set the file pointer to it. Same for
data. The file pointer is set to the region asked at an offset that is equal
to or greater than the supplied offset.


+   if (origin == SEEK_HOLE) {
+   extoff = cpos;
+   extoff<<= cs_bits;

extoff already has been assigned properly above in while loop?


To handle the case when supplied cpos == cend.

As always, excellent review.

Thanks
Sunil
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

2011-05-19 Thread Sunil Mushran

On 05/19/2011 04:05 AM, Christoph Hellwig wrote:

On Wed, May 18, 2011 at 07:44:44PM -0700, Sunil Mushran wrote:

Unwritten (preallocated) extents are considered holes because the file system
treats reads to such regions in the same way as it does to holes.

How does this work for the case of an unwrittent extent that has been
written to in the pagecache but not converted yet?  Y'know the big data
corruption and flamewar that started all this?


We don't delay splitting the extent. It is split in ->write_begin(). Delaying
it will be a challenge as we have to provide cache coherency across the
cluster.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

2011-05-19 Thread Christoph Hellwig
On Wed, May 18, 2011 at 07:44:44PM -0700, Sunil Mushran wrote:
> Unwritten (preallocated) extents are considered holes because the file system
> treats reads to such regions in the same way as it does to holes.

How does this work for the case of an unwrittent extent that has been
written to in the pagecache but not converted yet?  Y'know the big data
corruption and flamewar that started all this?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] SEEK_DATA/HOLE on ocfs2 - v2

2011-05-19 Thread Christoph Hellwig
On Wed, May 18, 2011 at 07:44:42PM -0700, Sunil Mushran wrote:
> It is improved since the last post. It runs cleanly on zfs, ocfs2 and ext3
> (default behavior). Users testing on zfs will need to flip the values of
> SEEK_HOLE/DATA.

sounds like we should switch the around, just to cause the least
amount of confusion.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

2011-05-19 Thread Tristan Ye
Sunil Mushran wrote:
> ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
> functionality.
> 
> SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
> (preallocated) extent, that is greater than or equal to the supplied offset.
> 
> SEEK_DATA sets the file pointer to the start of an allocated extent (not
> unwritten) that is greater than or equal to the supplied offset.
> 
> If the supplied offset is on a desired region, then the file pointer is set
> to it. Offsets greater than or equal to the file size return -ENXIO.
> 
> Unwritten (preallocated) extents are considered holes because the file system
> treats reads to such regions in the same way as it does to holes.
> 
> Signed-off-by: Sunil Mushran 
> ---
>  fs/ocfs2/extent_map.c |   97 
> +
>  fs/ocfs2/extent_map.h |2 +
>  fs/ocfs2/file.c   |   53 ++-
>  3 files changed, 150 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
> index 23457b4..6942c21 100644
> --- a/fs/ocfs2/extent_map.c
> +++ b/fs/ocfs2/extent_map.c
> @@ -832,6 +832,103 @@ out:
>   return ret;
>  }
>  
> +int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int 
> origin)
> +{
> + struct inode *inode = file->f_mapping->host;
> + int ret;
> + unsigned int is_last = 0, is_data = 0;
> + u16 cs_bits = OCFS2_SB(inode->i_sb)->s_clustersize_bits;
> + u32 cpos, cend, clen, hole_size;
> + u64 extoff, extlen;
> + struct buffer_head *di_bh = NULL;
> + struct ocfs2_extent_rec rec;
> +
> + BUG_ON(origin != SEEK_DATA && origin != SEEK_HOLE);
> +
> + ret = ocfs2_inode_lock(inode, &di_bh, 0);
> + if (ret) {
> + mlog_errno(ret);
> + goto out;
> + }
> +
> + down_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> + if (inode->i_size == 0 || *offset >= inode->i_size) {
> + ret = -ENXIO;
> + goto out_unlock;
> + }

Why not using if (*offset >= inode->i_size) directly?

> +
> + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
> + if (origin == SEEK_HOLE)
> + *offset = inode->i_size;
> + goto out_unlock;
> + }
> +
> + clen = 0;
> + cpos = *offset >> cs_bits;
> + cend = ocfs2_clusters_for_bytes(inode->i_sb, inode->i_size);
> +
> + while (cpos < cend && !is_last) {
> + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, &hole_size,
> +  &rec, &is_last);
> + if (ret) {
> + mlog_errno(ret);
> + goto out_unlock;
> + }
> +
> + extoff = cpos;
> + extoff <<= cs_bits;
> +
> + if (rec.e_blkno == 0ULL) {
> + clen = hole_size;
> + is_data = 0;
> + } else {
> + BUG_ON(cpos < le32_to_cpu(rec.e_cpos));


A same assert has already been performed inside ocfs2_get_clusters_nocache(),
does it make sense to do it again here?


> + clen = le16_to_cpu(rec.e_leaf_clusters) -
> + (cpos - le32_to_cpu(rec.e_cpos));
> + is_data = (rec.e_flags & OCFS2_EXT_UNWRITTEN) ?  0 : 1;
> + }
> +
> + if ((!is_data && origin == SEEK_HOLE) ||
> + (is_data && origin == SEEK_DATA)) {
> + if (extoff > *offset)
> + *offset = extoff;
> + goto out_unlock;

Seems above logic is going to stop at the first time we find a hole.

How about the offset was within the range of a hole already when we doing
SEEK_HOLE, shouldn't we proceed detecting until the next hole gets found, whose
start_offset was greater than supplied offset, according to semantics described
by the the header of this patch, should it be like following?

if (extoff > *offset) {
*offset = extoff;
goto out_unlock;
}

> + }
> +
> + if (!is_last)
> + cpos += clen;
> + }
> +
> + if (origin == SEEK_HOLE) {
> + extoff = cpos;
> + extoff <<= cs_bits;

extoff already has been assigned properly above in while loop?

> + extlen = clen;
> + extlen <<=  cs_bits;
> +
> + if ((extoff + extlen) > inode->i_size)
> + extlen = inode->i_size - extoff;
> + extoff += extlen;
> + if (extoff > *offset)
> + *offset = extoff;
> + goto out_unlock;
> + }
> +
> + ret = -ENXIO;
> +
> +out_unlock:
> +
> + brelse(di_bh);
> +
> + up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> + ocfs2_inode_unlock(inode, 0);
> +out:
> + if (ret && ret != -ENXIO)
> +   

Re: [PATCH 0/9] Btrfs: improve write ahead log with sub transaction

2011-05-19 Thread liubo
On 05/19/2011 04:11 PM, Liu Bo wrote:
> I've been working to try to improve the write-ahead log's performance,
> and I found that the bottleneck addresses in the checksum items,
> especially when we want to make a random write on a large file, e.g a 4G file.
> 
> Then a idea for this suggested by Chris is to use sub transaction ids and just
> to log the part of inode that had changed since either the last log commit or
> the last transaction commit.  And as we also push the sub transid into the 
> btree
> blocks, we'll get much faster tree walks.  As a result, we abandon the 
> original
> brute force approach, which is "to delete all items of the inode in log",
> to making sure we get the most uptodate copies of everything, and instead
> we manage to "find and merge", i.e. finding extents in the log tree and 
> merging
> in the new extents from the file.
> 
> This patchset puts the above idea into code, and although the code is now more
> complex, it brings us a great deal of performance improvement.
> 
> Beside the improvement of log, patch 8 fixes a small but critical bug of log 
> code
> with sub transaction.
> 
> Here I have some test results to show, I use sysbench to do "random write + 
> fsync".
> 
> ===
> sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K 
> --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync 
> --file-extra-flags=  [prepare, run]
> ===
> 
> Sysbench args:
>   - Number of threads: 1
>   - Extra file open flags: 0
>   - 2 files, 4Gb each
>   - Block size 4Kb
>   - Number of random requests for random IO: 1
>   - Read/Write ratio for combined random IO test: 1.50
>   - Periodic FSYNC enabled, calling fsync() each 100 requests.
>   - Calling fsync() at the end of test, Enabled.
>   - Using synchronous I/O mode
>   - Doing random write test
> 
> Sysbench results:
> ===
>Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
>Read 0b  Written 39.062Mb  Total transferred 39.062Mb
> ===
> a) without patch:  (*SPEED* : 451.01Kb/sec)
>112.75 Requests/sec executed
> 
> b) with patch: (*SPEED* : 4.3621Mb/sec)
>1116.71 Requests/sec executed
> 
> 
> Liu Bo (10):
>   Btrfs: introduce sub transaction stuff
>   Btrfs: modify should_cow_block to update block's generation
>   Btrfs: modify btrfs_drop_extents API
>   Btrfs: introduce first sub trans
>   Btrfs: still update inode transid when size remains unchanged
>   Btrfs: main log stuff
>   Btrfs: add checksum check for log
>   Btrfs: fix a bug of log check
>   Btrfs: kick off useless code
>   Btrfs: ship trans->transid to trans->transaction->transid
> 
>  fs/btrfs/btrfs_inode.h |   12 ++-
>  fs/btrfs/ctree.c   |   71 ++-
>  fs/btrfs/ctree.h   |5 +-
>  fs/btrfs/disk-io.c |9 +-
>  fs/btrfs/extent-tree.c |   10 ++-
>  fs/btrfs/file.c|   22 ++---
>  fs/btrfs/inode.c   |   28 --
>  fs/btrfs/ioctl.c   |6 +-
>  fs/btrfs/relocation.c  |6 +-
>  fs/btrfs/transaction.c |   13 ++-
>  fs/btrfs/transaction.h |   19 -
>  fs/btrfs/tree-defrag.c |2 +-
>  fs/btrfs/tree-log.c|  222 ---
>  13 files changed, 279 insertions(+), 146 deletions(-)
> 
> 

Sorry for the wrong analysis info, here is the right one:

Liu Bo (9):
  Btrfs: introduce sub transaction stuff
  Btrfs: update block generation if should_cow_block fails
  Btrfs: modify btrfs_drop_extents API
  Btrfs: introduce first sub trans
  Btrfs: still update inode trans stuff when size remains unchanged
  Btrfs: improve log with sub transaction
  Btrfs: add checksum check for log
  Btrfs: fix a bug of log check
  Btrfs: kick off useless code

 fs/btrfs/btrfs_inode.h |   12 ++-
 fs/btrfs/ctree.c   |   69 +++
 fs/btrfs/ctree.h   |5 +-
 fs/btrfs/disk-io.c |9 +-
 fs/btrfs/extent-tree.c |   10 ++-
 fs/btrfs/file.c|   22 ++---
 fs/btrfs/inode.c   |   28 --
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/relocation.c  |6 +-
 fs/btrfs/transaction.c |   13 ++-
 fs/btrfs/transaction.h |   19 -
 fs/btrfs/tree-defrag.c |2 +-
 fs/btrfs/tree-log.c|  222 ---
 13 files changed, 282 insertions(+), 141 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] Btrfs: kick off useless code

2011-05-19 Thread Liu Bo
fsync will wait for writeback till it finishes, and last_trans will get the real
transid recorded in writeback, so it does not need an extra +1 to ensure fsync's
process on the file.

Signed-off-by: Liu Bo 
---
 fs/btrfs/file.c |   13 -
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d19cf3a..73c46e2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1146,19 +1146,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 
mutex_unlock(&inode->i_mutex);
 
-   /*
-* we want to make sure fsync finds this change
-* but we haven't joined a transaction running right now.
-*
-* Later on, someone is sure to update the inode and get the
-* real transid recorded.
-*
-* We set last_trans now to the fs_info generation + 1,
-* this will either be one more than the running transaction
-* or the generation used for the next transaction if there isn't
-* one running right now.
-*/
-   BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
if (num_written > 0 || num_written == -EIOCBQUEUED) {
err = generic_write_sync(file, pos, num_written);
if (err < 0 && num_written > 0)
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/9] Btrfs: improve write ahead log with sub transaction

2011-05-19 Thread Liu Bo
I've been working to try to improve the write-ahead log's performance,
and I found that the bottleneck addresses in the checksum items,
especially when we want to make a random write on a large file, e.g a 4G file.

Then a idea for this suggested by Chris is to use sub transaction ids and just
to log the part of inode that had changed since either the last log commit or
the last transaction commit.  And as we also push the sub transid into the btree
blocks, we'll get much faster tree walks.  As a result, we abandon the original
brute force approach, which is "to delete all items of the inode in log",
to making sure we get the most uptodate copies of everything, and instead
we manage to "find and merge", i.e. finding extents in the log tree and merging
in the new extents from the file.

This patchset puts the above idea into code, and although the code is now more
complex, it brings us a great deal of performance improvement.

Beside the improvement of log, patch 8 fixes a small but critical bug of log 
code
with sub transaction.

Here I have some test results to show, I use sysbench to do "random write + 
fsync".

===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K 
--file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync 
--file-extra-flags=  [prepare, run]
===

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 1
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch: (*SPEED* : 4.3621Mb/sec)
   1116.71 Requests/sec executed


Liu Bo (10):
  Btrfs: introduce sub transaction stuff
  Btrfs: modify should_cow_block to update block's generation
  Btrfs: modify btrfs_drop_extents API
  Btrfs: introduce first sub trans
  Btrfs: still update inode transid when size remains unchanged
  Btrfs: main log stuff
  Btrfs: add checksum check for log
  Btrfs: fix a bug of log check
  Btrfs: kick off useless code
  Btrfs: ship trans->transid to trans->transaction->transid

 fs/btrfs/btrfs_inode.h |   12 ++-
 fs/btrfs/ctree.c   |   71 ++-
 fs/btrfs/ctree.h   |5 +-
 fs/btrfs/disk-io.c |9 +-
 fs/btrfs/extent-tree.c |   10 ++-
 fs/btrfs/file.c|   22 ++---
 fs/btrfs/inode.c   |   28 --
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/relocation.c  |6 +-
 fs/btrfs/transaction.c |   13 ++-
 fs/btrfs/transaction.h |   19 -
 fs/btrfs/tree-defrag.c |2 +-
 fs/btrfs/tree-log.c|  222 ---
 13 files changed, 279 insertions(+), 146 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] Btrfs: modify btrfs_drop_extents API

2011-05-19 Thread Liu Bo
We want to use btrfs_drop_extent() in log code.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h|3 ++-
 fs/btrfs/file.c |9 +++--
 fs/btrfs/inode.c|6 +++---
 fs/btrfs/ioctl.c|4 ++--
 fs/btrfs/tree-log.c |2 +-
 5 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ef68108..1ba3f91 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2575,7 +2575,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 
start, u64 end,
 int btrfs_check_file(struct btrfs_root *root, struct inode *inode);
 extern const struct file_operations btrfs_file_operations;
 int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
-  u64 start, u64 end, u64 *hint_byte, int drop_cache);
+  u64 start, u64 end, u64 *hint_byte, int drop_cache,
+  int log);
 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
  struct inode *inode, u64 start, u64 end);
 int btrfs_release_file(struct inode *inode, struct file *file);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 75899a0..d19cf3a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -290,7 +290,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, 
u64 end,
  * is deleted from the tree.
  */
 int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
-  u64 start, u64 end, u64 *hint_byte, int drop_cache)
+  u64 start, u64 end, u64 *hint_byte, int drop_cache,
+  int log)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct extent_buffer *leaf;
@@ -309,6 +310,10 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans, 
struct inode *inode,
int recow;
int ret;
 
+   /* drop the existed extents in log tree */
+   if (log)
+   root = root->log_root;
+
if (drop_cache)
btrfs_drop_extent_cache(inode, start, end - 1, 0);
 
@@ -489,7 +494,7 @@ next_slot:
extent_end - key.offset);
extent_end = ALIGN(extent_end,
   root->sectorsize);
-   } else if (disk_bytenr > 0) {
+   } else if (disk_bytenr > 0 && !log) {
ret = btrfs_free_extent(trans, root,
disk_bytenr, num_bytes, 0,
root->root_key.objectid,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4cec4c9..d823467 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -244,7 +244,7 @@ static noinline int cow_file_range_inline(struct 
btrfs_trans_handle *trans,
}
 
ret = btrfs_drop_extents(trans, inode, start, aligned_end,
-&hint_byte, 1);
+&hint_byte, 1, 0);
BUG_ON(ret);
 
if (isize > actual_end)
@@ -1639,7 +1639,7 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
 * with the others.
 */
ret = btrfs_drop_extents(trans, inode, file_pos, file_pos + num_bytes,
-&hint, 0);
+&hint, 0, 0);
BUG_ON(ret);
 
ins.objectid = inode->i_ino;
@@ -3649,7 +3649,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t 
oldsize, loff_t size)
 
err = btrfs_drop_extents(trans, inode, cur_offset,
 cur_offset + hole_size,
-&hint_byte, 1);
+&hint_byte, 1, 0);
if (err)
break;
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 142a82d..d5a6a19 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2014,7 +2014,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, 
unsigned long srcfd,
ret = btrfs_drop_extents(trans, inode,
 new_key.offset,
 new_key.offset + datal,
-&hint_byte, 1);
+&hint_byte, 1, 0);
BUG_ON(ret);
 
ret = btrfs_insert_empty_item(trans, root, path,
@@ -2069,7 +2069,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, 
unsigned long srcfd,
ret = btrfs_drop_extents(trans, inode,
 new_key.offset,
 new_key.offset + datal,
-  

[PATCH 8/9] Btrfs: fix a bug of log check

2011-05-19 Thread Liu Bo
The current code uses struct root's last_log_commit to check if an inode
has been logged, but the problem is that this root->last_log_commit is
shared among files.  Say we have N inodes to be logged, after the first
inode, root-last_log_commit is updated and the N-1 remains will not be
logged.

As we've introduce sub transaction and filled inode's last_trans and
logged_trans with sub_transid instead of transaction id, we can just
compare last_trans with logged_trans to determine if the processing inode
is logged.  And the more important thing is these two values are
inode-individual, so it will not interfere with others.

Signed-off-by: Liu Bo 
---
 fs/btrfs/btrfs_inode.h |5 -
 fs/btrfs/ctree.h   |1 -
 fs/btrfs/disk-io.c |2 --
 fs/btrfs/inode.c   |2 --
 fs/btrfs/transaction.h |1 -
 fs/btrfs/tree-log.c|   16 +++-
 6 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fb5617a..d3a570c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -94,11 +94,6 @@ struct btrfs_inode {
u64 last_trans;
 
/*
-* log transid when this inode was last modified
-*/
-   u64 last_sub_trans;
-
-   /*
 * transid that last logged this inode
 */
u64 logged_trans;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1ba3f91..73aa36b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1114,7 +1114,6 @@ struct btrfs_root {
atomic_t log_writers;
atomic_t log_commit[2];
unsigned long log_transid;
-   unsigned long last_log_commit;
unsigned long log_batch;
pid_t log_start_pid;
bool log_multiple_pids;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54842fe..ac8d2ac 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1079,7 +1079,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_writers, 0);
root->log_batch = 0;
root->log_transid = 0;
-   root->last_log_commit = 0;
extent_io_tree_init(&root->dirty_log_pages,
 fs_info->btree_inode->i_mapping, GFP_NOFS);
 
@@ -1216,7 +1215,6 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
WARN_ON(root->log_root);
root->log_root = log_root;
root->log_transid = 0;
-   root->last_log_commit = 0;
return 0;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32eac29..40f6f8f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6580,7 +6580,6 @@ again:
spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
 
BTRFS_I(inode)->last_trans = root->fs_info->sub_generation;
-   BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
@@ -6775,7 +6774,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
ei->sequence = 0;
ei->first_sub_trans = 0;
ei->last_trans = 0;
-   ei->last_sub_trans = 0;
ei->logged_trans = 0;
ei->delalloc_bytes = 0;
ei->reserved_bytes = 0;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index d531aea..e169553 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -99,7 +99,6 @@ static inline void btrfs_set_inode_last_trans(struct 
btrfs_trans_handle *trans,
spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
 
BTRFS_I(inode)->last_trans = trans->transid;
-   BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 }
 
 int btrfs_end_transaction(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 28b088b..912397c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1967,7 +1967,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
int ret;
struct btrfs_root *log = root->log_root;
struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
-   unsigned long log_transid = 0;
 
mutex_lock(&root->log_mutex);
index1 = root->log_transid % 2;
@@ -2002,8 +2001,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
goto out;
}
 
-   log_transid = root->log_transid;
-   if (log_transid % 2 == 0)
+   if (root->log_transid % 2 == 0)
mark = EXTENT_DIRTY;
else
mark = EXTENT_NEW;
@@ -2108,11 +2106,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
write_ctree_super(trans, root->fs_info->tree_root, 1);
ret = 0;
 
-   mutex_lock(&root->log_mutex);
-   if (root->last_log_commit < log_transid)
-   root->last_log_commit = log_transid;
-   mutex_unlock(&root->log_mutex);
-
 out_wake_log_root:
atomic_set(&log_root_tree->log_commit[index2], 0);
smp_mb();
@@ -3042,14 +3035,11 @@ out:
 static int inode_in_log(struct btrfs_trans_

[PATCH 5/9] Btrfs: still update inode trans stuff when size remains unchanged

2011-05-19 Thread Liu Bo
Due to DIO stuff, commit 1ef30be142d2cc60e2687ef267de864cf31be995 makes btrfs
not call btrfs_update_inode when it does not update i_disk_size, but in buffer
write case, we need to update btrfs internal inode's trans stuff, so that the
log code can find the inode's changes.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index acd5a38..32eac29 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1773,7 +1773,8 @@ static int btrfs_finish_ordered_io(struct inode *inode, 
u64 start, u64 end)
if (!ret) {
ret = btrfs_update_inode(trans, root, inode);
BUG_ON(ret);
-   }
+   } else
+   btrfs_set_inode_last_trans(trans, inode);
ret = 0;
 out:
if (nolock) {
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] Btrfs: introduce first sub trans

2011-05-19 Thread Liu Bo
In multi-thread situations, writeback of a file may span across several
sub transactions, and we need to introduce first_sub_trans to get sub_transid of
the first sub transaction recorded, so that log code can skip file extents which
have been logged or committed into disk.

Signed-off-by: Liu Bo 
---
 fs/btrfs/btrfs_inode.h |9 +
 fs/btrfs/inode.c   |   13 -
 fs/btrfs/transaction.h |   17 -
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 57c3bb2..fb5617a 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -79,6 +79,15 @@ struct btrfs_inode {
/* sequence number for NFS changes */
u64 sequence;
 
+   /* used to avoid race of first_sub_trans */
+   spinlock_t sub_trans_lock;
+
+   /*
+* sub transid of the trans that first modified this inode before
+* a trans commit or a log sync
+*/
+   u64 first_sub_trans;
+
/*
 * transid of the trans_handle that last modified this inode
 */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d823467..acd5a38 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6569,7 +6569,16 @@ again:
set_page_dirty(page);
SetPageUptodate(page);
 
-   BTRFS_I(inode)->last_trans = root->fs_info->generation;
+   spin_lock(&BTRFS_I(inode)->sub_trans_lock);
+
+   if (BTRFS_I(inode)->first_sub_trans > root->fs_info->sub_generation ||
+   BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans ||
+   BTRFS_I(inode)->last_trans <= root->fs_info->last_trans_committed)
+   BTRFS_I(inode)->first_sub_trans = root->fs_info->sub_generation;
+
+   spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
+
+   BTRFS_I(inode)->last_trans = root->fs_info->sub_generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
@@ -6763,6 +6772,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
ei->space_info = NULL;
ei->generation = 0;
ei->sequence = 0;
+   ei->first_sub_trans = 0;
ei->last_trans = 0;
ei->last_sub_trans = 0;
ei->logged_trans = 0;
@@ -6786,6 +6796,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
extent_io_tree_init(&ei->io_tree, &inode->i_data, GFP_NOFS);
extent_io_tree_init(&ei->io_failure_tree, &inode->i_data, GFP_NOFS);
mutex_init(&ei->log_mutex);
+   spin_lock_init(&ei->sub_trans_lock);
btrfs_ordered_inode_tree_init(&ei->ordered_tree);
INIT_LIST_HEAD(&ei->i_orphan);
INIT_LIST_HEAD(&ei->delalloc_inodes);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6dcdd28..d531aea 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -83,7 +83,22 @@ static inline void btrfs_update_inode_block_group(
 static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
  struct inode *inode)
 {
-   BTRFS_I(inode)->last_trans = trans->transaction->transid;
+   spin_lock(&BTRFS_I(inode)->sub_trans_lock);
+
+   /*
+* We have joined in a transaction, so btrfs_commit_transaction will
+* definitely wait for us and it does not need to add a extra
+* trans_mutex lock here.
+*/
+   if (BTRFS_I(inode)->first_sub_trans > trans->transid ||
+   BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans ||
+   BTRFS_I(inode)->last_trans <=
+BTRFS_I(inode)->root->fs_info->last_trans_committed)
+   BTRFS_I(inode)->first_sub_trans = trans->transid;
+
+   spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
+
+   BTRFS_I(inode)->last_trans = trans->transid;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 }
 
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] Btrfs: add checksum check for log

2011-05-19 Thread Liu Bo
If a inode is a BTRFS_INODE_NODATASUM one, it need not to look for csum items
any more.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tree-log.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 745933c..28b088b 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2652,7 +2652,8 @@ static noinline int copy_items(struct btrfs_trans_handle 
*trans,
   struct inode *inode,
   struct btrfs_path *dst_path,
   struct extent_buffer *src,
-  int start_slot, int nr, int inode_only)
+  int start_slot, int nr, int inode_only,
+  int csum)
 {
unsigned long src_offset;
unsigned long dst_offset;
@@ -2719,7 +2720,8 @@ static noinline int copy_items(struct btrfs_trans_handle 
*trans,
 * or deletes of this inode don't have to relog the inode
 * again
 */
-   if (btrfs_key_type(ins_keys + i) == BTRFS_EXTENT_DATA_KEY) {
+   if (btrfs_key_type(ins_keys + i) ==
+   BTRFS_EXTENT_DATA_KEY && csum) {
int found_type;
extent = btrfs_item_ptr(src, start_slot + i,
struct btrfs_file_extent_item);
@@ -2833,6 +2835,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
int ins_start_slot = 0;
int ins_nr;
u64 transid;
+   int csum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ? 0 : 1;
 
/*
 * We use transid in btrfs_search_forward() as a filter, in order to
@@ -2903,7 +2906,7 @@ filter:
if (ins_nr) {
ret = copy_items(trans, inode, dst_path, src,
 ins_start_slot,
-ins_nr, inode_only);
+ins_nr, inode_only, csum);
if (ret) {
err = ret;
goto out_unlock;
@@ -2922,7 +2925,7 @@ next_slot:
if (ins_nr) {
ret = copy_items(trans, inode, dst_path, src,
 ins_start_slot,
-ins_nr, inode_only);
+ins_nr, inode_only, csum);
if (ret) {
err = ret;
goto out_unlock;
@@ -2943,7 +2946,7 @@ next_slot:
if (ins_nr) {
ret = copy_items(trans, inode, dst_path, src,
 ins_start_slot,
-ins_nr, inode_only);
+ins_nr, inode_only, csum);
if (ret) {
err = ret;
goto out_unlock;
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] Btrfs: introduce sub transaction stuff

2011-05-19 Thread Liu Bo
Introduce a new concept "sub transaction",
the relation between transaction and sub transaction is

transaction A   ---> transid = x
   sub trans a(1)   ---> sub_transid = x+1
   sub trans a(2)   ---> sub_transid = x+2
 ... ...
   sub trans a(n-1) ---> sub_transid = x+n-1
   sub trans a(n)   ---> sub_transid = x+n
transaction B   ---> transid = x+n+1
 ... ...

And the most important is
a) a trans handler's transid now gets value from sub transid instead of transid.
b) when a transaction commits, transid may not added by 1, but depend on the
   biggest sub_transaction of the last neighbour transaction,
   i.e.
B->transid = a(n)->transid + 1,
(B->transid - A->transid) >= 1
c) we start a new sub transaction after a fsync.

We also ship some 'trans->transid' to 'trans->transaction->transid' to
ensure btrfs works well and to get rid of WARNings.

These are used for the new log code.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c   |   35 ++-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |7 ---
 fs/btrfs/extent-tree.c |   10 ++
 fs/btrfs/inode.c   |4 ++--
 fs/btrfs/ioctl.c   |2 +-
 fs/btrfs/relocation.c  |6 +++---
 fs/btrfs/transaction.c |   13 +
 fs/btrfs/transaction.h |1 +
 fs/btrfs/tree-defrag.c |2 +-
 fs/btrfs/tree-log.c|   16 ++--
 11 files changed, 60 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 84d7ca1..0c3b515 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -201,9 +201,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
int level;
struct btrfs_disk_key disk_key;
 
-   WARN_ON(root->ref_cows && trans->transid !=
+   WARN_ON(root->ref_cows && trans->transaction->transid !=
root->fs_info->running_transaction->transid);
-   WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+   WARN_ON(root->ref_cows && trans->transid < root->last_trans);
 
level = btrfs_header_level(buf);
if (level == 0)
@@ -398,9 +398,9 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
 
btrfs_assert_tree_locked(buf);
 
-   WARN_ON(root->ref_cows && trans->transid !=
+   WARN_ON(root->ref_cows && trans->transaction->transid !=
root->fs_info->running_transaction->transid);
-   WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+   WARN_ON(root->ref_cows && trans->transid < root->last_trans);
 
level = btrfs_header_level(buf);
 
@@ -466,7 +466,8 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
parent_start = 0;
 
-   WARN_ON(trans->transid != btrfs_header_generation(parent));
+   WARN_ON(btrfs_header_generation(parent) <
+   trans->transaction->transid);
btrfs_set_node_blockptr(parent, parent_slot,
cow->start);
btrfs_set_node_ptr_generation(parent, parent_slot,
@@ -487,7 +488,7 @@ static inline int should_cow_block(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf)
 {
-   if (btrfs_header_generation(buf) == trans->transid &&
+   if (btrfs_header_generation(buf) >= trans->transaction->transid &&
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
!(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
  btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
@@ -515,7 +516,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
   root->fs_info->running_transaction->transid);
WARN_ON(1);
}
-   if (trans->transid != root->fs_info->generation) {
+   if (trans->transaction->transid != root->fs_info->generation) {
printk(KERN_CRIT "trans %llu running %llu\n",
   (unsigned long long)trans->transid,
   (unsigned long long)root->fs_info->generation);
@@ -618,7 +619,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 
if (trans->transaction != root->fs_info->running_transaction)
WARN_ON(1);
-   if (trans->transid != root->fs_info->generation)
+   if (trans->transaction->transid != root->fs_info->generation)
WARN_ON(1);
 
parent_nritems = btrfs_header_nritems(parent);
@@ -898,7 +899,7 @@ static noinline int balance_level(struct btrfs_trans_handle 
*trans,
mid = path->nodes[level];
 
WARN_ON(!path->locks[level]);
-   WARN_ON(btrfs_header_generation(mid) != trans->transid);
+   WARN_ON(btrfs_header_generation(mid) < trans->transaction->transid);
 
orig_ptr = btrfs_node_blockptr(mid, orig_slot);
 
@@ -1105,7 +1106,7 @@ stat

[PATCH 6/9] Btrfs: improve log with sub transaction

2011-05-19 Thread Liu Bo
When logging an inode _A_, current btrfs will
a) clear all items belonged to _A_ in log,
b) copy all items belonged to _A_ from fs/file tree to log tree,
and this just wastes a lot of time, especially when logging big files.

So we want to use a smarter approach, i.e. "find and merge".
The amount of file extent items is the largest, so we focus on it.
Thanks to sub transaction, now we can find those file extent items which
are changed after last _transaction commit_ or last _log commit_, and
then merge them with the existed ones in log tree.

It will be great helpful on fsync performance, cause the common case is
"make changes on a _part_ of inode".

Signed-off-by: Liu Bo 
---
 fs/btrfs/tree-log.c |  177 ---
 1 files changed, 126 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 51d5024..745933c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2561,60 +2561,106 @@ again:
 }
 
 /*
- * a helper function to drop items from the log before we relog an
- * inode.  max_key_type indicates the highest item type to remove.
- * This cannot be run for file data extents because it does not
- * free the extents they point to.
+ * a helper function to drop items from the log before we merge
+ * the uptodate items into the log tree.
  */
-static int drop_objectid_items(struct btrfs_trans_handle *trans,
- struct btrfs_root *log,
- struct btrfs_path *path,
- u64 objectid, int max_key_type)
+static int prepare_for_merge_items(struct btrfs_trans_handle *trans,
+  struct inode *inode,
+  struct extent_buffer *eb,
+  int slot, int nr)
 {
-   int ret;
-   struct btrfs_key key;
+   struct btrfs_root *log = BTRFS_I(inode)->root->log_root;
+   struct btrfs_path *path;
struct btrfs_key found_key;
+   struct btrfs_key key;
+   int i;
+   int ret;
 
-   key.objectid = objectid;
-   key.type = max_key_type;
-   key.offset = (u64)-1;
+   /* There are no relative items of the inode in log. */
+   if (BTRFS_I(inode)->logged_trans < trans->transaction->transid)
+   return 0;
 
-   while (1) {
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   for (i = 0; i < nr; i++) {
+   btrfs_item_key_to_cpu(eb, &key, i + slot);
+
+   if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY) {
+   struct btrfs_file_extent_item *fi;
+   int found_type;
+   u64 mask = BTRFS_I(inode)->root->sectorsize - 1;
+   u64 start = key.offset;
+   u64 extent_end;
+   u64 hint;
+   unsigned long size;
+
+   fi = btrfs_item_ptr(eb, slot + i,
+struct btrfs_file_extent_item);
+   found_type = btrfs_file_extent_type(eb, fi);
+
+   if (found_type == BTRFS_FILE_EXTENT_REG ||
+   found_type == BTRFS_FILE_EXTENT_PREALLOC)
+   extent_end = start +
+   btrfs_file_extent_num_bytes(eb, fi);
+   else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+   size = btrfs_file_extent_inline_len(eb, fi);
+   extent_end = (start + size + mask) & ~mask;
+   } else
+   BUG_ON(1);
+
+   /* drop any overlapping extents */
+   ret = btrfs_drop_extents(trans, inode, start,
+extent_end, &hint, 0, 1);
+   BUG_ON(ret);
+
+   continue;
+   }
+
+   /* non file extent */
ret = btrfs_search_slot(trans, log, &key, path, -1, 1);
-   BUG_ON(ret == 0);
if (ret < 0)
break;
 
-   if (path->slots[0] == 0)
+   /* empty log! */
+   if (ret > 0 && path->slots[0] == 0)
break;
 
-   path->slots[0]--;
+   if (ret > 0) {
+   btrfs_release_path(log, path);
+   continue;
+   }
+
btrfs_item_key_to_cpu(path->nodes[0], &found_key,
  path->slots[0]);
 
-   if (found_key.objectid != objectid)
-   break;
+   if (btrfs_comp_cpu_keys(&found_key, &key))
+   BUG_ON(1);
 
ret = btrfs_del_item(trans, log, path);
BUG_ON(ret);
btrfs_release_path(log,

[PATCH 2/9] Btrfs: update block generation if should_cow_block fails

2011-05-19 Thread Liu Bo
Cause we've added sub transaction, if it do not want to cow a block, we also
need to get new sub transid recorded.  This is used for log code to find the
most uptodate file extents.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c |   34 +-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0c3b515..7e21fa9 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -484,6 +484,33 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
return 0;
 }
 
+static inline void update_block_generation(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct extent_buffer *buf,
+  struct extent_buffer *parent,
+  int slot)
+{
+   /*
+* If it does not need to cow this block, we still need to
+* update the block's generation, for transid may have been
+* changed during fsync.
+   */
+   if (btrfs_header_generation(buf) == trans->transid)
+   return;
+
+   if (buf == root->node) {
+   btrfs_set_header_generation(buf, trans->transid);
+   btrfs_mark_buffer_dirty(buf);
+   add_root_to_dirty_list(root);
+   } else {
+   btrfs_set_node_ptr_generation(parent, slot,
+ trans->transid);
+   btrfs_set_header_generation(buf, trans->transid);
+   btrfs_mark_buffer_dirty(parent);
+   btrfs_mark_buffer_dirty(buf);
+   }
+}
+
 static inline int should_cow_block(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf)
@@ -524,6 +551,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
}
 
if (!should_cow_block(trans, root, buf)) {
+   update_block_generation(trans, root, buf, parent, parent_slot);
*cow_ret = buf;
return 0;
}
@@ -1639,8 +1667,12 @@ again:
 * then we don't want to set the path blocking,
 * so we test it here
 */
-   if (!should_cow_block(trans, root, b))
+   if (!should_cow_block(trans, root, b)) {
+   update_block_generation(trans, root, b,
+   p->nodes[level + 1],
+   p->slots[level + 1]);
goto cow_done;
+   }
 
btrfs_set_path_blocking(p);
 
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel BUG at fs/btrfs/extent-tree.c:5637!

2011-05-19 Thread Christian Brunner
Hi,

we are running a ceph cluster with a btrfs store. Last night we ran
across this btrfs BUG.

Any hints on how to solve this are welcome.

Regards
Christian

May 19 06:10:07 os00 kernel: [247212.342712] [ cut here
]
May 19 06:10:07 os00 kernel: [247212.347953] kernel BUG at
fs/btrfs/extent-tree.c:5637!
May 19 06:10:07 os00 kernel: [247212.353773] invalid opcode:  [#1] SMP
May 19 06:10:07 os00 kernel: [247212.358449] last sysfs file:
/sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
May 19 06:10:07 os00 kernel: [247212.367268] CPU 6
May 19 06:10:07 os00 kernel: [247212.369407] Modules linked in: btrfs
zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe mdio iomemory_vsl(P)
hpsa igb dca squashfs usb_storage [last unloaded: scsi_wait_scan]
May 19 06:10:07 os00 kernel: [247212.393864]
May 19 06:10:07 os00 kernel: [247212.395618] Pid: 3074, comm: cosd
Tainted: P2.6.38.6-1.fits.3.el6.x86_64 #1 HP ProLiant
DL180 G6
May 19 06:10:07 os00 kernel: [247212.406885] RIP:
0010:[]  []
run_clustered_refs+0x54d/0x800 [btrfs]
May 19 06:10:07 os00 kernel: [247212.417468] RSP:
0018:8805dc6b99b8  EFLAGS: 00010282
May 19 06:10:07 os00 kernel: [247212.423482] RAX: ffef
RBX: 88037570ac00 RCX: 8805dc6b8000
May 19 06:10:07 os00 kernel: [247212.431528] RDX: 0008
RSI: 8800 RDI: 8805c7acabb0
May 19 06:10:07 os00 kernel: [247212.439572] RBP: 8805dc6b9a98
R08: 0001 R09: 0001
May 19 06:10:07 os00 kernel: [247212.447617] R10: 8805e0947000
R11: 8802e7e04480 R12: 8805ac2150c0
May 19 06:10:07 os00 kernel: [247212.455663] R13: 8804e758bc00
R14: 8805e0963000 R15: 8802e7e04480
May 19 06:10:07 os00 kernel: [247212.463709] FS:
7fc1691d2700() GS:8800bf2c()
knlGS:
May 19 06:10:07 os00 kernel: [247212.472820] CS:  0010 DS:  ES:
 CR0: 80050033
May 19 06:10:07 os00 kernel: [247212.479317] CR2: 7f907f6313b0
CR3: 0005dfad1000 CR4: 06e0
May 19 06:10:07 os00 kernel: [247212.487363] DR0: 
DR1:  DR2: 
May 19 06:10:07 os00 kernel: [247212.495408] DR3: 
DR6: 0ff0 DR7: 0400
May 19 06:10:07 os00 kernel: [247212.503453] Process cosd (pid: 3074,
threadinfo 8805dc6b8000, task 8805dc56e4c0)
May 19 06:10:07 os00 kernel: [247212.512563] Stack:
May 19 06:10:07 os00 kernel: [247212.514896]  
 88040001 
May 19 06:10:07 os00 kernel: [247212.523301]  8805dfd1c000
8805e1d71288  8805dc6b9ad8
May 19 06:10:07 os00 kernel: [247212.531673]  
0dd0 8805e1d711d0 0002
May 19 06:10:07 os00 kernel: [247212.540054] Call Trace:
May 19 06:10:07 os00 kernel: [247212.542883]  [] ?
btrfs_find_ref_cluster+0x1/0x180 [btrfs]
May 19 06:10:07 os00 kernel: [247212.550840]  []
btrfs_run_delayed_refs+0xc8/0x230 [btrfs]
May 19 06:10:07 os00 kernel: [247212.558700]  []
__btrfs_end_transaction+0x71/0x210 [btrfs]
May 19 06:10:07 os00 kernel: [247212.566685]  []
btrfs_end_transaction+0x15/0x20 [btrfs]
May 19 06:10:07 os00 kernel: [247212.574382]  []
btrfs_dirty_inode+0x8a/0x130 [btrfs]
May 19 06:10:07 os00 kernel: [247212.581752]  []
__mark_inode_dirty+0x3f/0x1e0
May 19 06:10:07 os00 kernel: [247212.588446]  []
file_update_time+0xec/0x170
May 19 06:10:07 os00 kernel: [247212.594952]  []
btrfs_file_aio_write+0x1d0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.602709]  [] ?
ima_counts_get+0x61/0x140
May 19 06:10:07 os00 kernel: [247212.609214]  [] ?
btrfs_file_aio_write+0x0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.616970]  []
do_sync_readv_writev+0xd3/0x110
May 19 06:10:07 os00 kernel: [247212.623855]  [] ?
path_put+0x22/0x30
May 19 06:10:07 os00 kernel: [247212.629675]  [] ?
selinux_file_permission+0xf3/0x150
May 19 06:10:07 os00 kernel: [247212.637044]  [] ?
security_file_permission+0x23/0x90
May 19 06:10:07 os00 kernel: [247212.644415]  []
do_readv_writev+0xd4/0x1e0
May 19 06:10:07 os00 kernel: [247212.650818]  [] ?
mutex_lock+0x31/0x60
May 19 06:10:07 os00 kernel: [247212.656832]  []
vfs_writev+0x46/0x60
May 19 06:10:07 os00 kernel: [247212.662653]  []
sys_writev+0x51/0xc0
May 19 06:10:07 os00 kernel: [247212.668477]  []
system_call_fastpath+0x16/0x1b
May 19 06:10:07 os00 kernel: [247212.675264] Code: 48 8b 75 a0 48 8b
7d a8 ba b0 00 00 00 e8 7c 6c 02 00 48 8b 95 78 ff ff ff 48 8b 75 a0
48 8b 7d a8 e8 68 6b 02 00 e9 04 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f
0b 66 0f 1f 84 00 00 00 00 00 eb f5
May 19 06:10:07 os00 kernel: [247212.697014] RIP  []
run_clustered_refs+0x54d/0x800 [btrfs]
May 19 06:10:07 os00 kernel: [247212.704981]  RSP 
May 19 06:10:07 os00 kernel: [247212.709579] ---[ end trace
b0954a112f69e38b ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-b

[PATCH] Btrfs: return error code to caller when btrfs_previous_item fails

2011-05-19 Thread Tsutomu Itoh
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.

Signed-off-by: Tsutomu Itoh 
---
 fs/btrfs/volumes.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8b9fb8c..c95b214 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -983,14 +983,14 @@ static int btrfs_free_dev_extent(struct 
btrfs_trans_handle *trans,
if (ret > 0) {
ret = btrfs_previous_item(root, path, key.objectid,
  BTRFS_DEV_EXTENT_KEY);
-   BUG_ON(ret);
+   if (ret)
+   goto out;
leaf = path->nodes[0];
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
extent = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_dev_extent);
BUG_ON(found_key.offset > start || found_key.offset +
   btrfs_dev_extent_length(leaf, extent) < start);
-   ret = 0;
} else if (ret == 0) {
leaf = path->nodes[0];
extent = btrfs_item_ptr(leaf, path->slots[0],
@@ -1003,6 +1003,7 @@ static int btrfs_free_dev_extent(struct 
btrfs_trans_handle *trans,
ret = btrfs_del_item(trans, root, path);
BUG_ON(ret);
 
+out:
btrfs_free_path(path);
return ret;
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html