Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:

> On Thu, 14 Jul 2011, Chris Mason wrote:
> 
> > Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
> >> On 07/14/2011 07:38 AM, NeilBrown wrote:
> >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler  
> >>> wrote:
> >>>
> > I'm certainly open to suggestions and collaboration.  Do you have in 
> > mind any
> > particular way to make the interface richer??
> >
> > NeilBrown
>  Hi Neil,
> 
>  I know that Chris has a very specific set of use cases for btrfs and 
>  think that
>  Alasdair and others have started to look at what is doable.
> 
>  The obvious use case is the following:
> 
>  If a file system uses checksumming or other data corruption detection 
>  bits, it
>  can detect that it got bad data on a write. If that data was protected 
>  by RAID,
>  it would like to ask the block layer to try to read from another mirror 
>  (for
>  raid1) or try to validate/rebuild from parity.
> 
>  Today, I think that a retry will basically just give us back a random 
>  chance of
>  getting data from a different mirror or the same one that we got data 
>  from on
>  the first go.
> 
>  Chris, Alasdair, was that a good summary of one concern?
> 
>  Thanks!
> 
>  Ric
> >>> I imagine a new field in 'struct bio' which was normally zero but could be
> >>> some small integer.  It is only meaningful for read.
> >>> When 0 it means "get this data way you like".
> >>> When non-zero it means "get this data using method N", where the different
> >>> methods are up to the device.
> >>>
> >>> For a mirrored RAID, method N means read from device N-1.
> >>> For stripe/parity RAID, method 1 means "use other data blocks and parity
> >>> blocks to reconstruct data.
> >>>
> >>> The default for non RAID devices is to return EINVAL for any N>  0.
> >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> >>> down.  I'm not sure how RAID1 over RAID5 would handle it... that might 
> >>> need
> >>> some thought.
> >>>
> >>> So if btrfs reads a block and the checksum looks wrong, it reads again 
> >>> with
> >>> a larger N.  It continues incrementing N and retrying until it gets a 
> >>> block
> >>> that it likes or it gets EINVAL.  There should probably be an error code
> >>> (EAGAIN?) which means "I cannot work with that number, but try the next 
> >>> one".
> >>>
> >>> It would be trivial for me to implement this for RAID1 and RAID10, and
> >>> relatively easy for RAID5.
> >>> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> >>> ways to reconstruct from different combinations of parity and data.  I'm 
> >>> not
> >>> sure if there would be much point in doing that though.
> >>>
> >>> It might make sense for a device to be able to report what the maximum
> >>> 'N' supported is... that might make stacked raid easier to manage...
> >>>
> >>> NeilBrown
> >>>
> >>
> >> I think that the above makes sense. Not sure what your "0" definition is, 
> >> but I
> >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" 
> >> would
> >> be an indication that there is nothing more that can be tried. The data 
> >> you got
> >> is all you are ever going to get :)
> >>
> >> For multiple mirrors, you might want to have a scheme where you would be 
> >> able to
> >> cycle through the mirrors. You could retry, cycling through the various 
> >> mirrors
> >> until you have tried and returned them all at which point you would get a 
> >> "no
> >> more" error back or some such thing.
> >
> > Hi everyone,
> >
> > The mirror number idea is basically what btrfs does today, and I think
> > it fits in with Neil's idea to have different policies for different
> > blocks.
> >
> > Basically what btrfs does:
> >
> > read_block(block_num, mirror = 0)
> > if (no io error and not csum error)
> > horray()
> >
> > num_mirrors = get_num_copies(block number)
> > for (i = 1; i < num_mirrors; i++) {
> > read_block(block_num, mirror = i);
> > }
> >
> > In a stacked configuration, the get_num_copies function can be smarter,
> > basically adding up all the copies of the lower levels and finding a way
> > to combine them.  We could just send down a fake bio that is responsible
> > for adding up the storage combinations into a bitmask or whatever works.
> >
> > We could also just keep retrying until the lower layers reported no more
> > mirror were available.
> >
> > In btrfs at least, we don't set the data blocks up to date until the crc
> > has passed, so replacing bogus blocks is easy.  Metadata is a little
> > more complex, but that's not really related to this topic.
> >
> > mirror number 0 just means "no mirror preference/pick the fastest
> > mirror" to the btrfs block mapping code.
> >
> > Btrfs has the concept of different raid levels for different logical
> > block numbers, so you get_num_copies m

Re: [PATCH 0/7] btrfs: don't BUG_ON btrfs_alloc_path errors

2011-07-14 Thread Tsutomu Itoh
Hi, Mark,

(2011/07/15 7:14), Mark Fasheh wrote:
> Hi,
> 
>   The following patches attempt to replace all the paths where we
> BUG_ON the return value of btrfs_alloc_path with proper error handling. It's
> pretty clear that these places aren't BUGing because of code error. To be
> explicit, much of the code is doing something like this:
> 
>   path = btrfs_alloc_path();
>   BUG_ON(!path);
> 
> which can be fixed by sending -ENOMEM back up the stack instead of the BUG.

I'm commenting to the similar patch posted before.
Please refer to the following mail.
  http://marc.info/?l=linux-btrfs&m=130258604905768&w=2

Thanks,
Tsutomu

> 
>   The first patch in my series fixes the most trivial sites in one go.
> The patches after the 1st fix one (more complicated) site each. In the patch
> descriptions I try my best to describe the thought process that went behind
> each change.
> 
>   Generally my guiding principle is that we want to "bubble up" some
> of the BUG_ON's that can be trapped and handled at a higher level -- the lower
> layer has an error and instead of killing the machine, sends it back up the
> stack for later handling
> 
>   I tested the patches with some kernel builds and snapshot commands.
> Please review - comments and feedback are welcome.
> 
> The patches can also be had from git:
> 
> git pull 
> git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling.git
>  alloc_path
>   --Mark
> 
> 
>  extent-tree.c |   20 +++-
>  file-item.c   |7 +--
>  file.c|3 ++-
>  inode.c   |   49 +++--
>  tree-log.c|   12 +---
>  volumes.c |   12 
>  6 files changed, 74 insertions(+), 29 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread david

On Thu, 14 Jul 2011, Chris Mason wrote:


Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:

On 07/14/2011 07:38 AM, NeilBrown wrote:

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler  wrote:


I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it
can detect that it got bad data on a write. If that data was protected by RAID,
it would like to ask the block layer to try to read from another mirror (for
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of
getting data from a different mirror or the same one that we got data from on
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means "get this data way you like".
When non-zero it means "get this data using method N", where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means "use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N>  0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means "I cannot work with that number, but try the next one".

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown



I think that the above makes sense. Not sure what your "0" definition is, but I
would assume that for non-raided devices (i.e., a single s-ata disk), "0" would
be an indication that there is nothing more that can be tried. The data you got
is all you are ever going to get :)

For multiple mirrors, you might want to have a scheme where you would be able to
cycle through the mirrors. You could retry, cycling through the various mirrors
until you have tried and returned them all at which point you would get a "no
more" error back or some such thing.


Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i < num_mirrors; i++) {
read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means "no mirror preference/pick the fastest
mirror" to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.


you don't want to just pass the value down as that will cause problems 
with layering (especially if the lower layer supports more values than a 
higher layer)


I would suggest that each layer take the value it's give, do an integer 
division by the number of values that layer supports, using the modulo 
value for that layer and pass the rest of the result down to the next 
layer.


this works for just trying values until you get the error that tells you 
there are no more options.



things can get very 'intersesting' if the different options below you have 
different 

Re: [PATCH 7/7] btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot

2011-07-14 Thread Tsutomu Itoh
(2011/07/15 7:15), Mark Fasheh wrote:
> In addition to properly handling allocation failure from btrfs_alloc_path, I
> also fixed up the kzalloc error handling code immediately below it.

Need not you correct the caller of btrfs_drop_snapshot()?

Thanks,
Tsutomu

> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/extent-tree.c |8 ++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa91773..a016590 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6268,10 +6268,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>   int level;
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   wc = kzalloc(sizeof(*wc), GFP_NOFS);
> - BUG_ON(!wc);
> + if (!wc) {
> + btrfs_free_path(path);
> + return -ENOMEM;
> + }
>  
>   trans = btrfs_start_transaction(tree_root, 0);
>   BUG_ON(IS_ERR(trans));


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix deadlock when throttling transactions

2011-07-14 Thread liubo
On 07/15/2011 01:26 AM, Josef Bacik wrote:
> Hit this nice little deadlock.  What happens is this
> 
> __btrfs_end_transaction with throttle set, --use_count so it equals 0
>   btrfs_commit_transaction
> 
> btrfs_end_transaction --use_count so now its -1 <== BAD
>   we just return and wait on the transaction
> 
> This is bad because we just return after our use_count is -1 and don't let go
> of our num_writer count on the transaction, so the guy committing the
> transaction just sits there forever.  Fix this by inc'ing our use_count if 
> we're
> going to call commit_transaction so that if we call btrfs_end_transaction it's
> valid.  Thanks,
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c |   13 ++---
>  1 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 654755b..00b81fb5 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -497,10 +497,17 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   if (lock && cur_trans->blocked && !cur_trans->in_commit) {
> - if (throttle)
> + if (throttle) {
> + /*
> +  * We may race with somebody else here so end up having
> +  * to call end_transaction on ourselves again, so inc
> +  * our use_count.
> +  */
> + trans->use_count++;
>   return btrfs_commit_transaction(trans, root);
> - else
> + } else {
>   wake_up_process(info->transaction_kthread);
> + }
>   }
>  
>   WARN_ON(cur_trans != info->running_transaction);
> @@ -1225,7 +1232,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans,
>   if (cur_trans->in_commit) {
>   spin_unlock(&cur_trans->commit_lock);
>   atomic_inc(&cur_trans->use_count);
> - btrfs_end_transaction(trans, root);
> + __btrfs_end_transaction(trans, root, 0, 1);
>  

Looks good.

BTW, btrfs_end_transaction(trans, root) is just __btrfs_end_transaction(trans, 
root, 0, 1).

thanks,
liubo

>   ret = wait_for_commit(root, cur_trans);
>   BUG_ON(ret);

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] btrfs: Don't BUG_ON alloc_path errors in find_next_chunk

2011-07-14 Thread Tsutomu Itoh
(2011/07/15 7:15), Mark Fasheh wrote:
> I also removed the BUG_ON from error return of find_next_chunk in
> init_first_rw_device(). It turns out that the only caller of
> init_first_rw_device() also BUGS on any nonzero return so no actual behavior
> change has occurred here.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/volumes.c |6 --
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 530a2fc..90d956c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1037,7 +1037,8 @@ static noinline int find_next_chunk(struct btrfs_root 
> *root,
>   struct btrfs_key found_key;
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;

If find_next_chunk() returns -ENOMEM, space_info->full becomes 1 by following 
code.

3205 static int do_chunk_alloc(struct btrfs_trans_handle *trans,
3206   struct btrfs_root *extent_root, u64 alloc_bytes,
3207   u64 flags, int force)
3208 {
...
3277 ret = btrfs_alloc_chunk(trans, extent_root, flags);
3278 spin_lock(&space_info->lock);
3279 if (ret)
3280 space_info->full = 1;
3281 else
3282 ret = 1;

Is it OK?

Thanks,
Tsutomu

>  
>   key.objectid = objectid;
>   key.offset = (u64)-1;
> @@ -2663,7 +2664,8 @@ static noinline int init_first_rw_device(struct 
> btrfs_trans_handle *trans,
>  
>   ret = find_next_chunk(fs_info->chunk_root,
> BTRFS_FIRST_CHUNK_TREE_OBJECTID, &chunk_offset);
> - BUG_ON(ret);
> + if (ret)
> + return ret;
>  
>   alloc_profile = BTRFS_BLOCK_GROUP_METADATA |
>   (fs_info->metadata_alloc_profile &


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Chris Mason
Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
> On 07/14/2011 07:38 AM, NeilBrown wrote:
> > On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler  wrote:
> >
> >>> I'm certainly open to suggestions and collaboration.  Do you have in mind 
> >>> any
> >>> particular way to make the interface richer??
> >>>
> >>> NeilBrown
> >> Hi Neil,
> >>
> >> I know that Chris has a very specific set of use cases for btrfs and think 
> >> that
> >> Alasdair and others have started to look at what is doable.
> >>
> >> The obvious use case is the following:
> >>
> >> If a file system uses checksumming or other data corruption detection 
> >> bits, it
> >> can detect that it got bad data on a write. If that data was protected by 
> >> RAID,
> >> it would like to ask the block layer to try to read from another mirror 
> >> (for
> >> raid1) or try to validate/rebuild from parity.
> >>
> >> Today, I think that a retry will basically just give us back a random 
> >> chance of
> >> getting data from a different mirror or the same one that we got data from 
> >> on
> >> the first go.
> >>
> >> Chris, Alasdair, was that a good summary of one concern?
> >>
> >> Thanks!
> >>
> >> Ric
> > I imagine a new field in 'struct bio' which was normally zero but could be
> > some small integer.  It is only meaningful for read.
> > When 0 it means "get this data way you like".
> > When non-zero it means "get this data using method N", where the different
> > methods are up to the device.
> >
> > For a mirrored RAID, method N means read from device N-1.
> > For stripe/parity RAID, method 1 means "use other data blocks and parity
> > blocks to reconstruct data.
> >
> > The default for non RAID devices is to return EINVAL for any N>  0.
> > A remapping device (dm-linear, dm-stripe etc) would just pass the number
> > down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> > some thought.
> >
> > So if btrfs reads a block and the checksum looks wrong, it reads again with
> > a larger N.  It continues incrementing N and retrying until it gets a block
> > that it likes or it gets EINVAL.  There should probably be an error code
> > (EAGAIN?) which means "I cannot work with that number, but try the next 
> > one".
> >
> > It would be trivial for me to implement this for RAID1 and RAID10, and
> > relatively easy for RAID5.
> > I'd need to give a bit of thought to RAID6 as there are possibly multiple
> > ways to reconstruct from different combinations of parity and data.  I'm not
> > sure if there would be much point in doing that though.
> >
> > It might make sense for a device to be able to report what the maximum
> > 'N' supported is... that might make stacked raid easier to manage...
> >
> > NeilBrown
> >
> 
> I think that the above makes sense. Not sure what your "0" definition is, but 
> I 
> would assume that for non-raided devices (i.e., a single s-ata disk), "0" 
> would 
> be an indication that there is nothing more that can be tried. The data you 
> got 
> is all you are ever going to get :)
> 
> For multiple mirrors, you might want to have a scheme where you would be able 
> to 
> cycle through the mirrors. You could retry, cycling through the various 
> mirrors 
> until you have tried and returned them all at which point you would get a "no 
> more" error back or some such thing.

Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i < num_mirrors; i++) {
read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means "no mirror preference/pick the fastest
mirror" to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] btrfs: don't BUG_ON btrfs_alloc_path() errors

2011-07-14 Thread Tsutomu Itoh
(2011/07/15 7:14), Mark Fasheh wrote:
> This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
> failure. All the sites that are fixed in this patch were checked by me to
> be fairly trivial to fix because of at least one of two criteria:
> 
>  - Callers of the function catch errors from it already so bubbling the
>error up will be handled.
>  - Callers of the function might BUG_ON any nonzero return code in which
>case there is no behavior changed (but we still got to remove a BUG_ON)
> 
> The following functions were updated:
> 
> btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
> btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
> btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
> insert_reserved_file_extent, and run_delalloc_nocow

Some patches have already been posted.
Please refer: http://marc.info/?l=linux-btrfs&m=131003114917755&w=2

Thanks,
Tsutomu

> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/btrfs/extent-tree.c |   12 +---
>  fs/btrfs/file-item.c   |7 +--
>  fs/btrfs/file.c|3 ++-
>  fs/btrfs/inode.c   |   18 +-
>  4 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 71cd456..aa91773 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -667,7 +667,9 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 
> start, u64 len)
>   struct btrfs_path *path;
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
> +
>   key.objectid = start;
>   key.offset = len;
>   btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
> @@ -5494,7 +5496,8 @@ static int alloc_reserved_tree_block(struct 
> btrfs_trans_handle *trans,
>   u32 size = sizeof(*extent_item) + sizeof(*block_info) + sizeof(*iref);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   path->leave_spinning = 1;
>   ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
> @@ -7162,7 +7165,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(&cluster->refill_lock);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path) {
> + ret = -ENOMEM;
> + goto out;
> + }
>  
>   inode = lookup_free_space_inode(root, block_group, path);
>   if (!IS_ERR(inode)) {
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 90d4ee5..f92ff0e 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -282,7 +282,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
> start, u64 end,
>   u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   if (search_commit) {
>   path->skip_locking = 1;
> @@ -672,7 +673,9 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle 
> *trans,
>   btrfs_super_csum_size(&root->fs_info->super_copy);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
> +
>   sector_sum = sums->sums;
>  again:
>   next_offset = (u64)-1;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fa4ef18..23d1d81 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -855,7 +855,8 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
> *trans,
>   btrfs_drop_extent_cache(inode, start, end - 1, 0);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  again:
>   recow = 0;
>   split = start;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3601f0a..8be7d7a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1070,7 +1070,8 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>   u64 ino = btrfs_ino(inode);
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   nolock = is_free_space_inode(root, inode);
>  
> @@ -1644,7 +1645,8 @@ static int insert_reserved_file_extent(struct 
> btrfs_trans_handle *trans,
>   int ret;
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   path->leave_spinning = 1;
>  
> @@ -3713,7 +3715,8 @@ static int btrfs_inode_by_name(struct inode *dir, 
> struct dentry *dentry,
>   int ret = 0;
>  
>   path = btrfs_alloc_path();
> - BUG_ON(!path);
> + if (!path)
> + return -ENOMEM;
>  
>   di = btrfs_lookup_dir_item(NULL, root, path, btrfs_ino(dir), name,
>   namelen, 0);
> @@ -4438,7 +4441,8 @@ static struct inode *btrfs_new_inode(struct 
> btrfs_trans_handle *trans,
>   int owne

[PATCH 3/7] btrfs: Don't BUG_ON alloc_path errors in btrfs_truncate_inode_items

2011-07-14 Thread Mark Fasheh
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/inode.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8be7d7a..a0faf7d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3172,6 +3172,11 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+   path->reada = -1;
+
if (root->ref_cows || root == root->fs_info->tree_root)
btrfs_drop_extent_cache(inode, new_size & (~mask), (u64)-1, 0);
 
@@ -3184,10 +3189,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
if (min_type == 0 && root == BTRFS_I(inode)->root)
btrfs_kill_delayed_inode_items(inode);
 
-   path = btrfs_alloc_path();
-   BUG_ON(!path);
-   path->reada = -1;
-
key.objectid = ino;
key.offset = (u64)-1;
key.type = (u8)-1;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] btrfs: Don't BUG_ON alloc_path errors in replay_one_buffer()

2011-07-14 Thread Mark Fasheh
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/tree-log.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4ce8a9f..f3cacc0 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1617,7 +1617,8 @@ static int replay_one_buffer(struct btrfs_root *log, 
struct extent_buffer *eb,
return 0;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
nritems = btrfs_header_nritems(eb);
for (i = 0; i < nritems; i++) {
@@ -1723,7 +1724,9 @@ static noinline int walk_down_log_tree(struct 
btrfs_trans_handle *trans,
return -ENOMEM;
 
if (*level == 1) {
-   wc->process_func(root, next, wc, ptr_gen);
+   ret = wc->process_func(root, next, wc, ptr_gen);
+   if (ret)
+   return ret;
 
path->slots[*level]++;
if (wc->free) {
@@ -1788,8 +1791,11 @@ static noinline int walk_up_log_tree(struct 
btrfs_trans_handle *trans,
parent = path->nodes[*level + 1];
 
root_owner = btrfs_header_owner(parent);
-   wc->process_func(root, path->nodes[*level], wc,
+   ret = wc->process_func(root, path->nodes[*level], wc,
 btrfs_header_generation(path->nodes[*level]));
+   if (ret)
+   return ret;
+
if (wc->free) {
struct extent_buffer *next;
 
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] btrfs: Don't BUG_ON alloc_path errors in find_next_chunk

2011-07-14 Thread Mark Fasheh
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/volumes.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 530a2fc..90d956c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1037,7 +1037,8 @@ static noinline int find_next_chunk(struct btrfs_root 
*root,
struct btrfs_key found_key;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
key.objectid = objectid;
key.offset = (u64)-1;
@@ -2663,7 +2664,8 @@ static noinline int init_first_rw_device(struct 
btrfs_trans_handle *trans,
 
ret = find_next_chunk(fs_info->chunk_root,
  BTRFS_FIRST_CHUNK_TREE_OBJECTID, &chunk_offset);
-   BUG_ON(ret);
+   if (ret)
+   return ret;
 
alloc_profile = BTRFS_BLOCK_GROUP_METADATA |
(fs_info->metadata_alloc_profile &
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot

2011-07-14 Thread Mark Fasheh
In addition to properly handling allocation failure from btrfs_alloc_path, I
also fixed up the kzalloc error handling code immediately below it.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/extent-tree.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index aa91773..a016590 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6268,10 +6268,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
int level;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
wc = kzalloc(sizeof(*wc), GFP_NOFS);
-   BUG_ON(!wc);
+   if (!wc) {
+   btrfs_free_path(path);
+   return -ENOMEM;
+   }
 
trans = btrfs_start_transaction(tree_root, 0);
BUG_ON(IS_ERR(trans));
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] btrfs: Don't BUG_ON alloc_path errors in btrfs_balance()

2011-07-14 Thread Mark Fasheh
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/volumes.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 19450bc..530a2fc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2061,8 +2061,10 @@ int btrfs_balance(struct btrfs_root *dev_root)
 
/* step two, relocate all the chunks */
path = btrfs_alloc_path();
-   BUG_ON(!path);
-
+   if (!path) {
+   ret = -ENOMEM;
+   goto error;
+   }
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
key.offset = (u64)-1;
key.type = BTRFS_CHUNK_ITEM_KEY;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] btrfs: Don't BUG_ON alloc_path errors in btrfs_read_locked_inode

2011-07-14 Thread Mark Fasheh
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/inode.c |   22 +-
 1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0faf7d..8882999 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2518,7 +2518,9 @@ static void btrfs_read_locked_inode(struct inode *inode)
filled = true;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   goto make_bad;
+
path->leave_spinning = 1;
memcpy(&location, &BTRFS_I(inode)->location, sizeof(location));
 
@@ -3973,6 +3975,7 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
 struct btrfs_root *root, int *new)
 {
struct inode *inode;
+   int bad_inode = 0;
 
inode = btrfs_iget_locked(s, location->objectid, root);
if (!inode)
@@ -3982,10 +3985,19 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
BTRFS_I(inode)->root = root;
memcpy(&BTRFS_I(inode)->location, location, sizeof(*location));
btrfs_read_locked_inode(inode);
-   inode_tree_add(inode);
-   unlock_new_inode(inode);
-   if (new)
-   *new = 1;
+   if (!is_bad_inode(inode)) {
+   inode_tree_add(inode);
+   unlock_new_inode(inode);
+   if (new)
+   *new = 1;
+   } else {
+   bad_inode = 1;
+   }
+   }
+
+   if (bad_inode) {
+   iput(inode);
+   inode = ERR_PTR(-ESTALE);
}
 
return inode;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] btrfs: don't BUG_ON btrfs_alloc_path() errors

2011-07-14 Thread Mark Fasheh
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:

 - Callers of the function catch errors from it already so bubbling the
   error up will be handled.
 - Callers of the function might BUG_ON any nonzero return code in which
   case there is no behavior changed (but we still got to remove a BUG_ON)

The following functions were updated:

btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/extent-tree.c |   12 +---
 fs/btrfs/file-item.c   |7 +--
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |   18 +-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 71cd456..aa91773 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -667,7 +667,9 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, 
u64 len)
struct btrfs_path *path;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
+
key.objectid = start;
key.offset = len;
btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
@@ -5494,7 +5496,8 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
u32 size = sizeof(*extent_item) + sizeof(*block_info) + sizeof(*iref);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
@@ -7162,7 +7165,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
spin_unlock(&cluster->refill_lock);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path) {
+   ret = -ENOMEM;
+   goto out;
+   }
 
inode = lookup_free_space_inode(root, block_group, path);
if (!IS_ERR(inode)) {
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 90d4ee5..f92ff0e 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -282,7 +282,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
if (search_commit) {
path->skip_locking = 1;
@@ -672,7 +673,9 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
btrfs_super_csum_size(&root->fs_info->super_copy);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
+
sector_sum = sums->sums;
 again:
next_offset = (u64)-1;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fa4ef18..23d1d81 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -855,7 +855,8 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
*trans,
btrfs_drop_extent_cache(inode, start, end - 1, 0);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 again:
recow = 0;
split = start;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3601f0a..8be7d7a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1070,7 +1070,8 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
u64 ino = btrfs_ino(inode);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
nolock = is_free_space_inode(root, inode);
 
@@ -1644,7 +1645,8 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
int ret;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
path->leave_spinning = 1;
 
@@ -3713,7 +3715,8 @@ static int btrfs_inode_by_name(struct inode *dir, struct 
dentry *dentry,
int ret = 0;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
di = btrfs_lookup_dir_item(NULL, root, path, btrfs_ino(dir), name,
namelen, 0);
@@ -4438,7 +4441,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
int owner;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return ERR_PTR(-ENOMEM);
 
inode = new_inode(root->fs_info->sb);
if (!inode) {
@@ -7194,7 +7198,11 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
goto out_unlock;
 

[PATCH 0/7] btrfs: don't BUG_ON btrfs_alloc_path errors

2011-07-14 Thread Mark Fasheh
Hi,

The following patches attempt to replace all the paths where we
BUG_ON the return value of btrfs_alloc_path with proper error handling. It's
pretty clear that these places aren't BUGing because of code error. To be
explicit, much of the code is doing something like this:

path = btrfs_alloc_path();
BUG_ON(!path);

which can be fixed by sending -ENOMEM back up the stack instead of the BUG.

The first patch in my series fixes the most trivial sites in one go.
The patches after the 1st fix one (more complicated) site each. In the patch
descriptions I try my best to describe the thought process that went behind
each change.

Generally my guiding principle is that we want to "bubble up" some
of the BUG_ON's that can be trapped and handled at a higher level -- the lower
layer has an error and instead of killing the machine, sends it back up the
stack for later handling

I tested the patches with some kernel builds and snapshot commands.
Please review - comments and feedback are welcome.

The patches can also be had from git:

git pull 
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling.git 
alloc_path
--Mark


 extent-tree.c |   20 +++-
 file-item.c   |7 +--
 file.c|3 ++-
 inode.c   |   49 +++--
 tree-log.c|   12 +---
 volumes.c |   12 
 6 files changed, 74 insertions(+), 29 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] btrfs: don't BUG_ON btrfs_alloc_path() errors

2011-07-14 Thread Mark Fasheh
On Thu, Jul 14, 2011 at 03:00:07PM -0700, Mark Fasheh wrote:
> This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
> failure. All the sites that are fixed in this patch were checked by me to
> be fairly trivial to fix because of at least one of two criteria:

Please ignore - these didn't get sent out right, sorry for that. Will resend
ASAP.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] btrfs: don't BUG_ON btrfs_alloc_path() errors

2011-07-14 Thread Mark Fasheh
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:

 - Callers of the function catch errors from it already so bubbling the
   error up will be handled.
 - Callers of the function might BUG_ON any nonzero return code in which
   case there is no behavior changed (but we still got to remove a BUG_ON)

The following functions were updated:

btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/extent-tree.c |   12 +---
 fs/btrfs/file-item.c   |7 +--
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |   18 +-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 71cd456..aa91773 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -667,7 +667,9 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, 
u64 len)
struct btrfs_path *path;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
+
key.objectid = start;
key.offset = len;
btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
@@ -5494,7 +5496,8 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
u32 size = sizeof(*extent_item) + sizeof(*block_info) + sizeof(*iref);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
@@ -7162,7 +7165,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
spin_unlock(&cluster->refill_lock);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path) {
+   ret = -ENOMEM;
+   goto out;
+   }
 
inode = lookup_free_space_inode(root, block_group, path);
if (!IS_ERR(inode)) {
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 90d4ee5..f92ff0e 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -282,7 +282,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
if (search_commit) {
path->skip_locking = 1;
@@ -672,7 +673,9 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
btrfs_super_csum_size(&root->fs_info->super_copy);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
+
sector_sum = sums->sums;
 again:
next_offset = (u64)-1;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fa4ef18..23d1d81 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -855,7 +855,8 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
*trans,
btrfs_drop_extent_cache(inode, start, end - 1, 0);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 again:
recow = 0;
split = start;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3601f0a..8be7d7a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1070,7 +1070,8 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
u64 ino = btrfs_ino(inode);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
nolock = is_free_space_inode(root, inode);
 
@@ -1644,7 +1645,8 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
int ret;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
path->leave_spinning = 1;
 
@@ -3713,7 +3715,8 @@ static int btrfs_inode_by_name(struct inode *dir, struct 
dentry *dentry,
int ret = 0;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
di = btrfs_lookup_dir_item(NULL, root, path, btrfs_ino(dir), name,
namelen, 0);
@@ -4438,7 +4441,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
int owner;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return ERR_PTR(-ENOMEM);
 
inode = new_inode(root->fs_info->sb);
if (!inode) {
@@ -7194,7 +7198,11 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
goto out_unlock;
 

[PATCH 0/7] btrfs: don't BUG_ON btrfs_alloc_path errors

2011-07-14 Thread Mark Fasheh
Hi,

The following patches attempt to replace all the paths where we
BUG_ON the return value of btrfs_alloc_path with proper error handling. It's
pretty clear that these places aren't BUGing because of code error. To be
explicit, much of the code is doing something like this:

path = btrfs_alloc_path();
BUG_ON(!path);

which can be fixed by sending -ENOMEM back up the stack instead of the BUG.

The first patch in my series fixes the most trivial sites in one go.
The patches after the 1st fix one (more complicated) site each. In the patch
descriptions I try my best to describe the thought process that went behind
each change.

Generally my guiding principle is that we want to "bubble up" some
of the BUG_ON's that can be trapped and handled at a higher level -- the lower
layer has an error and instead of killing the machine, sends it back up the
stack for later handling

I tested the patches with some kernel builds and snapshot commands.
Please review - comments and feedback are welcome.

The patches can also be had from git:

git pull 
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling.git 
alloc_path
--Mark


 extent-tree.c |   20 +++-
 file-item.c   |7 +--
 file.c|3 ++-
 inode.c   |   49 +++--
 tree-log.c|   12 +---
 volumes.c |   12 
 6 files changed, 74 insertions(+), 29 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: make btrfs_set_root_node void

2011-07-14 Thread Mark Fasheh
This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void.  All callers ignore the return code now anyway.  I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/ctree.h |4 ++--
 fs/btrfs/root-tree.c |5 ++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3b859a3..20bd05f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2404,8 +2404,8 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 
objectid, struct
 btrfs_root_item *item, struct btrfs_key *key);
 int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid);
 int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
-int btrfs_set_root_node(struct btrfs_root_item *item,
-   struct extent_buffer *node);
+void btrfs_set_root_node(struct btrfs_root_item *item,
+struct extent_buffer *node);
 void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
 
 /* dir-item.c */
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index ea96ab8..d411cb4 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -71,13 +71,12 @@ out:
return ret;
 }
 
-int btrfs_set_root_node(struct btrfs_root_item *item,
-   struct extent_buffer *node)
+void btrfs_set_root_node(struct btrfs_root_item *item,
+struct extent_buffer *node)
 {
btrfs_set_root_bytenr(item, node->start);
btrfs_set_root_level(item, btrfs_header_level(node));
btrfs_set_root_generation(item, btrfs_header_generation(node));
-   return 0;
 }
 
 /*
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik
On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
 On 06/06/2011 09:39 PM, Miao Xie wrote:
> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>> I got a lot of these when running stress.sh on my test box
>>
>>
>>
>> This is because use_block_rsv() is having to do a
>> reserve_metadata_bytes(), which shouldn't happen as we should have
>> reserved enough space for those operations to complete.  This is
>> happening because use_block_rsv() will call get_block_rsv(), which if
>> root->ref_cows is set (which is the case on all fs roots) we will use
>> trans->block_rsv, which will only have what the current transaction
>> starter had reserved.
>>
>> What needs to be done instead is we need to have a block reserve that
>> any reservation that is done at create time for these inodes is migrated
>> to this special reserve, and then when you run the delayed inode items
>> stuff you set trans->block_rsv to the special block reserve so the
>> accounting is all done properly.
>>
>> This is just off the top of my head, there may be a better way to do it,
>> I've not actually looked that the delayed inode code at all.
>>
>> I would do this myself but I have a ever increasing list of shit to do
>> so will somebody pick this up and fix it please?  Thanks,
>
> Sorry, it's my miss.
> I forgot to set trans->block_rsv to global_block_rsv, since we have 
> migrated
> the space from trans_block_rsv to global_block_rsv.
>
> I'll fix it soon.
>

 There is another problem, we're failing xfstest 204.  I tried making
 reserve_metadata_bytes commit the transaction regardless of whether or
 not there were pinned bytes but the test just hung there.  Usually it
 takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
 204 just creates a crap ton of files, which is what is killing us.
 There needs to be a way to start flushing delayed inode items so we can
 reclaim the space they are holding onto so we don't get enospc, and it
 needs to be better than just committing the transaction because that is
 dog slow.  Thanks,

 Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 

Ok ditch that previous patch and try this one, it should work.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..2263d29 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 */
u64 disk_i_size;

-   /* flags field from the on disk inode */
-   u32 flags;
-
/*
 * if this is a directory then index_cnt is the counter for the index
 * number for new files that are created
@@ -128,14 +125,8 @@ struct btrfs_inode {
 */
u64 last_unlink_trans;

-   /*
-* Counters to keep track of the number of extent item's we may use due
-* to delalloc and such.  outstanding_extents is the number of extent
-* items we think we'll end up using, and reserved_extents is the number
-* of extent items we've reserved metadata for.
-*/
-   atomic_t outstanding_extents;
-   atomic_t reserved_extents;
+   /* flags field from the on disk inode */
+   u32 flags;

/*
 * ordered_data_close is set by truncate when a file that used
@@ -151,12 +142,21 @@ struct btrfs_inode {
unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
unsigned in_defrag:1;
-
/*
 * always compress this one file
 */
unsigned force_compress:4;

+   /*
+* Counters to keep track of the number of extent item's we may use due
+* to delalloc and such.  outstanding_extents is the number of extent
+* items we think we'll end up using, and reserved_extents is the number
+* of extent items we've reserved metadata for.
+*/
+   spinlock_t extents_count_lock;
+   unsigned outstanding_extents;
+   unsigned reserved_extents;
+
struct btrfs_delayed_node *delayed_node;

struct inode vfs_inode;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be02cae..3ba4d5f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2133,7 +2133,7 @@ static inline bool btrfs_mixed_space_info(s

Re: Mis-Design of Btrfs?

2011-07-14 Thread Erik Jensen
 On Thu, Jul 14, 2011 at 12:50 PM, John Stoffel  wrote:
>> "Alasdair" == Alasdair G Kergon  writes:
>
> Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
>>> It might make sense for a device to be able to report what the maximum
>>> 'N' supported is... that might make stacked raid easier to manage...
>
> Alasdair> I'll just say that any solution ought to be stackable.
>
> I've been mulling this over too and wondering how you'd handle this,
> because upper layers really can't peak down into lower layers easily.
> As far as I understand things.
>
> So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks
>
> How does btrfs handle errors (or does it even see them!) from the
> raid6 level when a single nbd device goes away?  Or taking the
> original example, when btrfs notices a checksum isn't correct, how
> would it push down multiple levels to try and find the correct data?
>
> Alasdair> This means understanding both that the number of data access
> Alasdair> routes may vary as you move through the stack, and that this
> Alasdair> number may depend on the offset within the device.
>
> It almost seems to me that the retry needs to be done at each level on
> it's own, without pushing down or up the stack.  But this doesn't
> handle the wrong file checksum issue.
>
> Hmm... maybe instead of just one number, we need another to count the
> levels down you go (or just split 16bit integer in half, bottom half
> being count of tries, the upper half being levels down to try that
> read?)
>
> It seems to defeat the purpose of layers if you can go down and find
> out how many layers there are underneath you
>
> John

A random thought: What if we allow the number to wrap at each level,
and, each time it wraps, increment the number passed to the next lower
level.

A zero would propagate down, letting each level do what it wants:
luks: 0
raid1: 0
raid6: 0
nbd: 0

And higher numbers would indicate the method at each level:

For a 1:
luks: 1
raid1: 1
raid6: 1
nbd: 1

For a 3:
luks: 1 (only one possibility, passes three down)
raid1: 1 (two possibilities, so we wrap back to one and pass two down,
since we wrapped once)
raid6: 2 (not wrapped)
nbd: 1

When the bottom-most level gets an N that it can't handle, it would
return EINVAL, which would be propagated up the stack.

This would allow the same algorithm of incrementing N until we receive
good data or EINVAL, and would exhaust all ways of reading the data at
all levels.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread david

On Thu, 14 Jul 2011, John Stoffel wrote:


Alasdair> I'll just say that any solution ought to be stackable.

I've been mulling this over too and wondering how you'd handle this,
because upper layers really can't peak down into lower layers easily.
As far as I understand things.

So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks

How does btrfs handle errors (or does it even see them!) from the
raid6 level when a single nbd device goes away?  Or taking the
original example, when btrfs notices a checksum isn't correct, how
would it push down multiple levels to try and find the correct data?

Alasdair> This means understanding both that the number of data access
Alasdair> routes may vary as you move through the stack, and that this
Alasdair> number may depend on the offset within the device.

It almost seems to me that the retry needs to be done at each level on
it's own, without pushing down or up the stack.  But this doesn't
handle the wrong file checksum issue.

Hmm... maybe instead of just one number, we need another to count the
levels down you go (or just split 16bit integer in half, bottom half
being count of tries, the upper half being levels down to try that
read?)

It seems to defeat the purpose of layers if you can go down and find
out how many layers there are underneath you


this is why just an arbatrary 'method number' rather than a bitmap or 
something like that should be used.


using your example:


So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks


raid1 has at least 2 values, raid 6 has at least 2 values, the combination 
of the two stacked should have at least 4 values.


if each layer can query the layer below it to find out how many methods it 
supports, it can then combine each of the methods it supports with each of 
the methods supported by the layer below it.


this will stack to an arbatrary number of layers, only limited by how 
large the value is allowed to be limiting the combinational permutations 
of all the layers options.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Erik Jensen
 On Wed, Jun 29, 2011 at 3:47 AM, A. James Lewis  wrote:
> Is there a possibility that one could have a 3 disk RAID5 array, and
> then add a 4th disk and then do a balance, growing the RAID5 onto 4
> disks and gaining the space still with RAID5?  It seems that to be
> consistent, BTRFS would have to do this.
>
> If this is the case, then I think that the BTRFS implementation of RAID5
> would have to be quite different to the MD implementation.
>
> James.

 My understanding, gleaned from IRC several months ago, is that Btrfs
 would use the new drive, but not change the stripe size. Each
 allocation would then be written across some selection of three of the
 four drives.

 In other words, if you started with four stripes across three drives:
   AAA
   BBB
   CCC
   DDD
 and then added a drive and balanced, you might get something like:
   AAAB
   BBCC
   CDDD
 which would give you more space, but still use ⅓ of the space for parity.

 Trying to remove a drive from the original three-drive configuration
 would be an error, similarly to trying to remove the second to last
 drive in RAID 1, currently.

 Actually changing the stripe size would be done using the same
 mechanism as changing RAID levels.

 Again, this is just an interested but uninvolved person's
 understanding based on an IRC conversation, so please salt to taste.

 -- Erik
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread John Stoffel
> "Alasdair" == Alasdair G Kergon  writes:

Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
>> It might make sense for a device to be able to report what the maximum
>> 'N' supported is... that might make stacked raid easier to manage...
 
Alasdair> I'll just say that any solution ought to be stackable.

I've been mulling this over too and wondering how you'd handle this,
because upper layers really can't peak down into lower layers easily.
As far as I understand things.

So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks

How does btrfs handle errors (or does it even see them!) from the
raid6 level when a single nbd device goes away?  Or taking the
original example, when btrfs notices a checksum isn't correct, how
would it push down multiple levels to try and find the correct data? 

Alasdair> This means understanding both that the number of data access
Alasdair> routes may vary as you move through the stack, and that this
Alasdair> number may depend on the offset within the device.

It almost seems to me that the retry needs to be done at each level on
it's own, without pushing down or up the stack.  But this doesn't
handle the wrong file checksum issue.  

Hmm... maybe instead of just one number, we need another to count the
levels down you go (or just split 16bit integer in half, bottom half
being count of tries, the upper half being levels down to try that
read?)

It seems to defeat the purpose of layers if you can go down and find
out how many layers there are underneath you

John

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: kill reserved_bytes in inode

2011-07-14 Thread Josef Bacik
reserved_bytes is not used for anything in the inode, remove it.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/btrfs_inode.h |5 -
 fs/btrfs/extent-tree.c |2 --
 fs/btrfs/inode.c   |1 -
 3 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 120240a..25e2ceb 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -100,11 +100,6 @@ struct btrfs_inode {
 */
u64 delalloc_bytes;
 
-   /* total number of bytes that may be used for this inode for
-* delalloc
-*/
-   u64 reserved_bytes;
-
/*
 * the size of the file stored in the metadata on disk.  data=ordered
 * means the in-memory i_size might be larger than the size on disk
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 152669b..33e5205 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3113,7 +3113,6 @@ commit_trans:
return -ENOSPC;
}
data_sinfo->bytes_may_use += bytes;
-   BTRFS_I(inode)->reserved_bytes += bytes;
spin_unlock(&data_sinfo->lock);
 
return 0;
@@ -3135,7 +3134,6 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 bytes)
data_sinfo = BTRFS_I(inode)->space_info;
spin_lock(&data_sinfo->lock);
data_sinfo->bytes_may_use -= bytes;
-   BTRFS_I(inode)->reserved_bytes -= bytes;
spin_unlock(&data_sinfo->lock);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index faf516e..3a32131 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6771,7 +6771,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
ei->last_sub_trans = 0;
ei->logged_trans = 0;
ei->delalloc_bytes = 0;
-   ei->reserved_bytes = 0;
ei->disk_i_size = 0;
ei->flags = 0;
ei->index_cnt = (u64)-1;
-- 
1.7.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: move stuff around in btrfs_inode to get better packing

2011-07-14 Thread Josef Bacik
Moving things around to give us better packing in the btrfs_inode.  This reduces
the size of our inode by 8 bytes.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/btrfs_inode.h |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..120240a 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 */
u64 disk_i_size;
 
-   /* flags field from the on disk inode */
-   u32 flags;
-
/*
 * if this is a directory then index_cnt is the counter for the index
 * number for new files that are created
@@ -128,6 +125,9 @@ struct btrfs_inode {
 */
u64 last_unlink_trans;
 
+   /* flags field from the on disk inode */
+   u32 flags;
+
/*
 * Counters to keep track of the number of extent item's we may use due
 * to delalloc and such.  outstanding_extents is the number of extent
-- 
1.7.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik
On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
 On 06/06/2011 09:39 PM, Miao Xie wrote:
> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>> I got a lot of these when running stress.sh on my test box
>>
>>
>>
>> This is because use_block_rsv() is having to do a
>> reserve_metadata_bytes(), which shouldn't happen as we should have
>> reserved enough space for those operations to complete.  This is
>> happening because use_block_rsv() will call get_block_rsv(), which if
>> root->ref_cows is set (which is the case on all fs roots) we will use
>> trans->block_rsv, which will only have what the current transaction
>> starter had reserved.
>>
>> What needs to be done instead is we need to have a block reserve that
>> any reservation that is done at create time for these inodes is migrated
>> to this special reserve, and then when you run the delayed inode items
>> stuff you set trans->block_rsv to the special block reserve so the
>> accounting is all done properly.
>>
>> This is just off the top of my head, there may be a better way to do it,
>> I've not actually looked that the delayed inode code at all.
>>
>> I would do this myself but I have a ever increasing list of shit to do
>> so will somebody pick this up and fix it please?  Thanks,
>
> Sorry, it's my miss.
> I forgot to set trans->block_rsv to global_block_rsv, since we have 
> migrated
> the space from trans_block_rsv to global_block_rsv.
>
> I'll fix it soon.
>

 There is another problem, we're failing xfstest 204.  I tried making
 reserve_metadata_bytes commit the transaction regardless of whether or
 not there were pinned bytes but the test just hung there.  Usually it
 takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
 204 just creates a crap ton of files, which is what is killing us.
 There needs to be a way to start flushing delayed inode items so we can
 reclaim the space they are holding onto so we don't get enospc, and it
 needs to be better than just committing the transaction because that is
 dog slow.  Thanks,

 Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Actually scratch that last note, it's wrong.  I'll send you an updated
patch when I've got this mess all sorted out.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix deadlock when throttling transactions

2011-07-14 Thread Josef Bacik
Hit this nice little deadlock.  What happens is this

__btrfs_end_transaction with throttle set, --use_count so it equals 0
  btrfs_commit_transaction

btrfs_end_transaction --use_count so now its -1 <== BAD
  we just return and wait on the transaction

This is bad because we just return after our use_count is -1 and don't let go
of our num_writer count on the transaction, so the guy committing the
transaction just sits there forever.  Fix this by inc'ing our use_count if we're
going to call commit_transaction so that if we call btrfs_end_transaction it's
valid.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 654755b..00b81fb5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -497,10 +497,17 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
}
 
if (lock && cur_trans->blocked && !cur_trans->in_commit) {
-   if (throttle)
+   if (throttle) {
+   /*
+* We may race with somebody else here so end up having
+* to call end_transaction on ourselves again, so inc
+* our use_count.
+*/
+   trans->use_count++;
return btrfs_commit_transaction(trans, root);
-   else
+   } else {
wake_up_process(info->transaction_kthread);
+   }
}
 
WARN_ON(cur_trans != info->running_transaction);
@@ -1225,7 +1232,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
if (cur_trans->in_commit) {
spin_unlock(&cur_trans->commit_lock);
atomic_inc(&cur_trans->use_count);
-   btrfs_end_transaction(trans, root);
+   __btrfs_end_transaction(trans, root, 0, 1);
 
ret = wait_for_commit(root, cur_trans);
BUG_ON(ret);
-- 
1.7.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Alasdair G Kergon
On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
 
I'll just say that any solution ought to be stackable.
This means understanding both that the number of data access routes may
vary as you move through the stack, and that this number may depend on
the offset within the device.

Alasdair

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Mis-Design of Btrfs?

2011-07-14 Thread Goffredo Baroncelli
On 07/14/2011 08:38 AM, NeilBrown wrote:
> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler  wrote:
> 
>>> I'm certainly open to suggestions and collaboration.  Do you have in mind 
>>> any
>>> particular way to make the interface richer??
>>>
>>> NeilBrown
>>
>> Hi Neil,
>>
>> I know that Chris has a very specific set of use cases for btrfs and think 
>> that 
>> Alasdair and others have started to look at what is doable.
>>
>> The obvious use case is the following:
>>
>> If a file system uses checksumming or other data corruption detection bits, 
>> it 
>> can detect that it got bad data on a write. If that data was protected by 
>> RAID, 
>> it would like to ask the block layer to try to read from another mirror (for 
>> raid1) or try to validate/rebuild from parity.
>>
>> Today, I think that a retry will basically just give us back a random chance 
>> of 
>> getting data from a different mirror or the same one that we got data from 
>> on 
>> the first go.
>>
>> Chris, Alasdair, was that a good summary of one concern?
>>
>> Thanks!
>>
>> Ric
> 
> I imagine a new field in 'struct bio' which was normally zero but could be
> some small integer.  It is only meaningful for read.
> When 0 it means "get this data way you like".
> When non-zero it means "get this data using method N", where the different
> methods are up to the device.

In more general terms, the filesystem should be able to require: try
another read different regarding the previous ones. The term are
important because we should differentiate the case of "wrong data from
disk1, read from disk0" and "wrong data from disk0 read disk1". I prefer
thinking the field as bitmap. Every bit represent a different way of
read. So it is possible to reuse to track which "kind of read" was
already used.

After a 2nd read, the block layer should:
a) redo the read if possible, otherwise FAIL
b) pass the data to the filesystem
c) if the filesystem accepts the new data, replace the wrong
   data with the correct one or mark the block as broken.
d) inform the userspace/filesystem of the result

> 
> For a mirrored RAID, method N means read from device N-1.
> For stripe/parity RAID, method 1 means "use other data blocks and parity
> blocks to reconstruct data.
> 
> The default for non RAID devices is to return EINVAL for any N > 0.
> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> some thought.
> 
> So if btrfs reads a block and the checksum looks wrong, it reads again with
> a larger N.  It continues incrementing N and retrying until it gets a block
> that it likes or it gets EINVAL.  There should probably be an error code
> (EAGAIN?) which means "I cannot work with that number, but try the next one".
> 
> It would be trivial for me to implement this for RAID1 and RAID10, and
> relatively easy for RAID5.
> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> ways to reconstruct from different combinations of parity and data.  I'm not
> sure if there would be much point in doing that though.
> 
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
> 
> NeilBrown
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik
On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
 On 06/06/2011 09:39 PM, Miao Xie wrote:
> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>> I got a lot of these when running stress.sh on my test box
>>
>>
>>
>> This is because use_block_rsv() is having to do a
>> reserve_metadata_bytes(), which shouldn't happen as we should have
>> reserved enough space for those operations to complete.  This is
>> happening because use_block_rsv() will call get_block_rsv(), which if
>> root->ref_cows is set (which is the case on all fs roots) we will use
>> trans->block_rsv, which will only have what the current transaction
>> starter had reserved.
>>
>> What needs to be done instead is we need to have a block reserve that
>> any reservation that is done at create time for these inodes is migrated
>> to this special reserve, and then when you run the delayed inode items
>> stuff you set trans->block_rsv to the special block reserve so the
>> accounting is all done properly.
>>
>> This is just off the top of my head, there may be a better way to do it,
>> I've not actually looked that the delayed inode code at all.
>>
>> I would do this myself but I have a ever increasing list of shit to do
>> so will somebody pick this up and fix it please?  Thanks,
>
> Sorry, it's my miss.
> I forgot to set trans->block_rsv to global_block_rsv, since we have 
> migrated
> the space from trans_block_rsv to global_block_rsv.
>
> I'll fix it soon.
>

 There is another problem, we're failing xfstest 204.  I tried making
 reserve_metadata_bytes commit the transaction regardless of whether or
 not there were pinned bytes but the test just hung there.  Usually it
 takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
 204 just creates a crap ton of files, which is what is killing us.
 There needs to be a way to start flushing delayed inode items so we can
 reclaim the space they are holding onto so we don't get enospc, and it
 needs to be better than just committing the transaction because that is
 dog slow.  Thanks,

 Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Ok I see what's wrong, our checksum calculation is completely bogus.
I'm in the middle of something big so I can't give you a nice clean
patch, so if you can just go into extent-tree.c and replace
calc_csum_metadata_size with this you should be good to go

static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
int num_leaves;
int num_csums;
u16 csum_size =
btrfs_super_csum_size(&root->fs_info->super_copy);

num_csums = (int)div64_u64(num_bytes, root->sectorsize);
num_leaves = (int)((num_csums * csum_size) / root->leafsize);

return btrfs_calc_trans_metadata_size(root, num_leaves);
}


Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs panic

2011-07-14 Thread Xiao Guangrong
When xfstests 224 was running, the box was panic, and i got this message:

[ 1998.327235] =
[ 1998.329940] [ INFO: possible recursive locking detected ]
[ 1998.329940] 2.6.39+ #3
[ 1998.329940] -
[ 1998.329940] dd/25718 is trying to acquire lock:
[ 1998.329940]  (&(&eb->lock)->rlock){+.+...}, at: [] 
btrfs_try_spin_lock+0x2a/0x89 [btrfs]
[ 1998.329940] 
[ 1998.329940] but task is already holding lock:
[ 1998.329940]  (&(&eb->lock)->rlock){+.+...}, at: [] 
btrfs_clear_lock_blocking+0x22/0x2b [btrfs]
[ 1998.478275] 
[ 1998.478275] other info that might help us debug this:
[ 1998.478275] 2 locks held by dd/25718:
[ 1998.478275]  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: 
[] btrfs_file_aio_write+0xdc/0x49a [btrfs]
[ 1998.478275]  #1:  (&(&eb->lock)->rlock){+.+...}, at: [] 
btrfs_clear_lock_blocking+0x22/0x2b [btrfs]
[ 1998.478275] 
[ 1998.478275] stack backtrace:
[ 1998.478275] Pid: 25718, comm: dd Not tainted 2.6.39+ #3
[ 1998.478275] Call Trace:
[ 1998.478275]  [] __lock_acquire+0xd47/0xdcf
[ 1998.478275]  [] ? sched_clock+0x9/0xd
[ 1998.478275]  [] ? sched_clock_local+0x12/0x75
[ 1998.478275]  [] ? btrfs_clear_lock_blocking+0x22/0x2b 
[btrfs]
[ 1998.478275]  [] ? btrfs_try_spin_lock+0x2a/0x89 [btrfs]
[ 1998.478275]  [] lock_acquire+0xd1/0xfb
[ 1998.478275]  [] ? btrfs_try_spin_lock+0x2a/0x89 [btrfs]
[ 1998.478275]  [] _raw_spin_lock+0x36/0x69
[ 1998.478275]  [] ? btrfs_try_spin_lock+0x2a/0x89 [btrfs]
[ 1998.478275]  [] btrfs_try_spin_lock+0x2a/0x89 [btrfs]
[ 1998.478275]  [] btrfs_search_slot+0x39c/0x4c0 [btrfs]
[ 1998.478275]  [] btrfs_lookup_xattr+0x76/0xd7 [btrfs]
[ 1998.478275]  [] ? btrfs_alloc_path+0x1a/0x1c [btrfs]
[ 1998.478275]  [] ? kmem_cache_alloc+0x57/0xfc
[ 1998.478275]  [] ? btrfs_file_aio_write+0x45/0x49a [btrfs]
[ 1998.478275]  [] __btrfs_getxattr+0x86/0x11c [btrfs]
[ 1998.478275]  [] btrfs_getxattr+0x77/0x82 [btrfs]
[ 1998.478275]  [] cap_inode_need_killpriv+0x2d/0x37
[ 1998.478275]  [] file_remove_suid+0x27/0x64
[ 1998.478275]  [] btrfs_file_aio_write+0x159/0x49a [btrfs]
[ 1998.478275]  [] ? trace_hardirqs_off+0xd/0xf
[ 1998.478275]  [] ? local_clock+0x36/0x4d
[ 1998.478275]  [] ? lock_release_non_nested+0xdb/0x263
[ 1998.478275]  [] do_sync_write+0xcb/0x108
[ 1998.478275]  [] ? might_fault+0x5c/0xac
[ 1998.478275]  [] ? lock_is_held+0x8d/0x98
[ 1998.478275]  [] vfs_write+0xaf/0x102
[ 1998.478275]  [] ? fget_light+0x3a/0xa1
[ 1998.478275]  [] sys_write+0x4d/0x74
[ 1998.478275]  [] system_call_fastpath+0x16/0x1b

[ 2160.937580] INFO: task xfs_io:22734 blocked for more than 120 seconds.
[ 2160.953899] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2160.978494] xfs_io  D  0 22734  21963 0x
[ 2160.996597]  88000ac8dc68 0046 88000ac8dc08 

[ 2161.107976]  001d3ec0 001d3ec0 001d3ec0 
882423a0
[ 2161.117511]  001d3ec0 88000ac8dfd8 001d3ec0 
001d3ec0
[ 2161.127543] Call Trace:
[ 2161.131247]  [] ? do_last+0x1d2/0x59d
[ 2161.136678]  [] ? do_last+0x1d2/0x59d
[ 2161.142181]  [] __mutex_lock_common+0x22b/0x35b
[ 2161.148104]  [] ? do_last+0x1d2/0x59d
[ 2161.153578]  [] mutex_lock_nested+0x3e/0x43
[ 2161.159300]  [] do_last+0x1d2/0x59d
[ 2161.164589]  [] path_openat+0xcb/0x33a
[ 2161.170358]  [] ? sched_clock+0x9/0xd
[ 2161.175941]  [] ? sched_clock_local+0x12/0x75
[ 2161.182033]  [] do_filp_open+0x3d/0x89
[ 2161.187301]  [] ? _raw_spin_unlock+0x2b/0x2f
[ 2161.192937]  [] ? alloc_fd+0x181/0x193
[ 2161.198541]  [] do_sys_open+0x74/0x106
[ 2161.204058]  [] sys_open+0x20/0x22
[ 2161.209488]  [] system_call_fastpath+0x16/0x1b
[ 2161.215279] INFO: lockdep is turned off.
[ 2161.219841] Kernel panic - not syncing: hung_task: blocked tasks
[ 2161.225647] Pid: 42, comm: khungtaskd Not tainted 2.6.39+ #3
[ 2161.231535] Call Trace:
[ 2161.235146]  [] panic+0x91/0x1a9
[ 2161.240266]  [] watchdog+0x1ae/0x219
[ 2161.244863]  [] ? rcu_read_unlock+0x23/0x23
[ 2161.250816]  [] kthread+0xa0/0xa8
[ 2161.255995]  [] ? trace_hardirqs_on_caller+0x13f/0x172
[ 2161.262333]  [] kernel_thread_helper+0x4/0x10
[ 2161.268211]  [] ? retint_restore_args+0x13/0x13
[ 2161.274113]  [] ? __init_kthread_worker+0x5b/0x5b
[ 2161.280135]  [] ? gs_change+0x13/0x13


I am happy to answer any questions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 11:37:41 +0200 Jan Schmidt 
wrote:

> Hi Neil,
> 
> On 14.07.2011 08:38, NeilBrown wrote:
> > I imagine a new field in 'struct bio' which was normally zero but could be
> > some small integer.  It is only meaningful for read.
> > When 0 it means "get this data way you like".
> > When non-zero it means "get this data using method N", where the different
> > methods are up to the device.
> > 
> > For a mirrored RAID, method N means read from device N-1.
> > For stripe/parity RAID, method 1 means "use other data blocks and parity
> > blocks to reconstruct data.
> > 
> > The default for non RAID devices is to return EINVAL for any N > 0.
> > A remapping device (dm-linear, dm-stripe etc) would just pass the number
> > down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> > some thought.
> > 
> > So if btrfs reads a block and the checksum looks wrong, it reads again with
> > a larger N.  It continues incrementing N and retrying until it gets a block
> > that it likes or it gets EINVAL.  There should probably be an error code
> > (EAGAIN?) which means "I cannot work with that number, but try the next 
> > one".
> 
> I like this idea. It comes pretty close to what btrfs is currently doing
> (what was the reason for this thread being started, wasn't it?), only
> not within struct bio.
> 
> The way you describe the extra parameter is input only. I think it would
> be a nice add on if we knew which "N" was used when "0" passed for the
> request. At least for the RAID1 case, I'd like to see that when I submit
> a bio with method (or whatever we call it) "0", it comes back with the
> method field set to the value that reflects the method the device
> actually used. Obviously, when passing non-zero values, the bio would
> have to come back with that value unmodified.
> 
> That way, we'll get more control on the retry algorithms and are free to
> decide not to use the failed method again, if we like.
> 
> Setting "method" on the return path might be valuable not only for
> RAID1, but I haven't thought that trough.
> 
> -Jan

That sounds like it would be reasonable...

It would be important not to read too much into the number though.  i.e.
don't think of it as "RAID1" but keep a much more abstract idea, else you
could run into trouble.

A (near) future addition to md is keeping track of 'bad blocks' so we can
fail more gracefully as devices start to fail.
So a read request to a RAID1 might not be served by just one device, but
might be served by one device for some parts, and another device for other
parts, so as to avoid one or more 'bad blocks'.

I think I could still provide a reasonably consistent mapping from 'arbitrary
number' to 'set of devices', but it may not be what you expect.  And the
number '1' may well correspond to different devices for different bi_sector
offsets.

i.e. the abstraction must allow the low level implementation substantial
freedom to choosing how to implement each request.

But yes - I don't see why we couldn't report which strategy was used with the
implication that using that same strategy at the same offset with the same
size would be expected to produce the same result.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread Jan Schmidt
Hi Neil,

On 14.07.2011 08:38, NeilBrown wrote:
> I imagine a new field in 'struct bio' which was normally zero but could be
> some small integer.  It is only meaningful for read.
> When 0 it means "get this data way you like".
> When non-zero it means "get this data using method N", where the different
> methods are up to the device.
> 
> For a mirrored RAID, method N means read from device N-1.
> For stripe/parity RAID, method 1 means "use other data blocks and parity
> blocks to reconstruct data.
> 
> The default for non RAID devices is to return EINVAL for any N > 0.
> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> some thought.
> 
> So if btrfs reads a block and the checksum looks wrong, it reads again with
> a larger N.  It continues incrementing N and retrying until it gets a block
> that it likes or it gets EINVAL.  There should probably be an error code
> (EAGAIN?) which means "I cannot work with that number, but try the next one".

I like this idea. It comes pretty close to what btrfs is currently doing
(what was the reason for this thread being started, wasn't it?), only
not within struct bio.

The way you describe the extra parameter is input only. I think it would
be a nice add on if we knew which "N" was used when "0" passed for the
request. At least for the RAID1 case, I'd like to see that when I submit
a bio with method (or whatever we call it) "0", it comes back with the
method field set to the value that reflects the method the device
actually used. Obviously, when passing non-zero values, the bio would
have to come back with that value unmodified.

That way, we'll get more control on the retry algorithms and are free to
decide not to use the failed method again, if we like.

Setting "method" on the return path might be valuable not only for
RAID1, but I haven't thought that trough.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Christian Brunner
2011/7/13 Josef Bacik :
> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>> 2011/6/7 Josef Bacik :
>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
 On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
> I got a lot of these when running stress.sh on my test box
>
>
>
> This is because use_block_rsv() is having to do a
> reserve_metadata_bytes(), which shouldn't happen as we should have
> reserved enough space for those operations to complete.  This is
> happening because use_block_rsv() will call get_block_rsv(), which if
> root->ref_cows is set (which is the case on all fs roots) we will use
> trans->block_rsv, which will only have what the current transaction
> starter had reserved.
>
> What needs to be done instead is we need to have a block reserve that
> any reservation that is done at create time for these inodes is migrated
> to this special reserve, and then when you run the delayed inode items
> stuff you set trans->block_rsv to the special block reserve so the
> accounting is all done properly.
>
> This is just off the top of my head, there may be a better way to do it,
> I've not actually looked that the delayed inode code at all.
>
> I would do this myself but I have a ever increasing list of shit to do
> so will somebody pick this up and fix it please?  Thanks,

 Sorry, it's my miss.
 I forgot to set trans->block_rsv to global_block_rsv, since we have 
 migrated
 the space from trans_block_rsv to global_block_rsv.

 I'll fix it soon.

>>>
>>> There is another problem, we're failing xfstest 204.  I tried making
>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>> not there were pinned bytes but the test just hung there.  Usually it
>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>> 204 just creates a crap ton of files, which is what is killing us.
>>> There needs to be a way to start flushing delayed inode items so we can
>>> reclaim the space they are holding onto so we don't get enospc, and it
>>> needs to be better than just committing the transaction because that is
>>> dog slow.  Thanks,
>>>
>>> Josef
>>
>> Is there a solution for this?
>>
>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>> (except the pluging). When starting a ceph rebuild on the btrfs
>> volumes I get a lot of warnings from block_rsv_use_bytes in
>> use_block_rsv:
>>
>
> Ok I think I've got this nailed down.  Will you run with this patch and make 
> sure the warnings go away?  Thanks,

I'm sorry, I'm still getting a lot of warnings like the one below.

I've also noticed, that I'm not getting these messages when the
free_space_cache is disabled.

Christian

[  697.398097] [ cut here ]
[  697.398109] WARNING: at fs/btrfs/extent-tree.c:5693
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[  697.398111] Hardware name: ProLiant DL180 G6
[  697.398112] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[  697.398122] Pid: 6591, comm: btrfs-freespace Tainted: PW
3.0.0-1.fits.1.el6.x86_64 #1
[  697.398124] Call Trace:
[  697.398128]  [] warn_slowpath_common+0x7f/0xc0
[  697.398131]  [] warn_slowpath_null+0x1a/0x20
[  697.398142]  [] btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[  697.398156]  [] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[  697.398316]  [] split_leaf+0x142/0x8c0 [btrfs]
[  697.398325]  [] ? generic_bin_search+0x19b/0x210 [btrfs]
[  697.398334]  [] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[  697.398344]  [] btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[  697.398355]  [] btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[  697.398369]  [] ? clear_extent_bit+0x17a/0x440 [btrfs]
[  697.398382]  [] add_pending_csums+0x49/0x70 [btrfs]
[  697.398395]  [] btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[  697.398408]  []
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[  697.398422]  []
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[  697.398425]  [] ? schedule_timeout+0x17b/0x2e0
[  697.398436]  [] ? end_workqueue_fn+0xe9/0x130 [btrfs]
[  697.398439]  [] bio_endio+0x1d/0x40
[  697.398451]  [] end_workqueue_fn+0xf4/0x130 [btrfs]
[  697.398464]  [] worker_loop+0x13e/0x540 [btrfs]
[  697.398477]  [] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398490]  [] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398493]  [] kthread+0x96/0xa0
[  697.398496]  [] kernel_thread_helper+0x4/0x10
[  697.398499]  [] ? kthread_worker_fn+0x1a0/0x1a0
[  697.398502]  [] ? gs_change+0x13/0x13
[  697.398503] ---[ end trace 8c77269b0de3f0fb ]---
[  697.432225] [ cut here ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org