Re: [PATCH] Btrfs: Don't error on resizing FS to same size

2011-11-18 Thread Mike Fleetwood
On Fri, Nov 18, 2011 at 03:52:00PM +1100, Chris Samuel wrote:
 On 18/11/11 08:04, Mike Fleetwood wrote:
 
  It seems overly harsh to fail a resize of a btrfs file system to the
  same size when a shrink or grow would succeed.  User app GParted trips
  over this error.  Allow it by bypassing the shrink or grow operation.
 
 OK - I'm a newbie with the code (and I'm looking at Linus's current git
 rather than any dev tree of Chris's), but...
 
  Signed-off-by: Mike Fleetwood mike.fleetw...@googlemail.com
  ---
 [...]
  ---
   fs/btrfs/ioctl.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
  index dae5dfe..00b7024 100644
  --- a/fs/btrfs/ioctl.c
  +++ b/fs/btrfs/ioctl.c
  @@ -1251,7 +1251,7 @@ static noinline int btrfs_ioctl_resize(struct 
  btrfs_root *root,
  }
  ret = btrfs_grow_device(trans, device, new_size);
  btrfs_commit_transaction(trans, root);
  -   } else {
  +   } else if (new_size  old_size) {
 
 shouldn't that be:
 
 + } else if (new_size  old_size) {
 
 otherwise you'll never try and shrink if new_size is  old_size..
 
  ret = btrfs_shrink_device(device, new_size);
  }
   

Chris, you're correct.  I have messed up a 1 line patch by rushing.
Will send corrected patch after some more testing!

Embarrassed,
Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 254: disable space cache

2011-11-18 Thread Li Zefan
I can't pass 254, and below is the output:

254 3s ... - output mismatch (see 254.out.bad)
...
 ID 256 top level 5 path snap
-ID 257 top level 5 path subvol
+ID 258 top level 5 path subvol

When space cache is enabled (and now mkfs.btrfs always enables it),
there will be some space cache inodes in the root tree, and they
consume some IDs, and that's why subvol has the ID 258 but not 257.

Just disable space cache for this test case.

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 254 |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/254 b/254
index 5216120..1cd4832 100755
--- a/254
+++ b/254
@@ -48,7 +48,7 @@ _supported_os Linux
 _require_scratch
 
 _scratch_mkfs  /dev/null 21
-_scratch_mount
+_scratch_mount -o nospace_cache
 
 # First test basic snapshotting
 echo Creating file foo in root dir
-- 
1.7.3.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix deadlock on metadata reservation when evicting a inode

2011-11-18 Thread Miao Xie
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
   start transaction
|
v
   reserve meta-data space
|
v
   flush delay allocation - iput inode - evict inode
^   |
|   v
   wait for delay allocation flush - reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/ctree.h   |3 +++
 fs/btrfs/extent-tree.c |   22 ++
 fs/btrfs/inode.c   |2 +-
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b1cb3c0..04631af 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2368,6 +2368,9 @@ int btrfs_block_rsv_check(struct btrfs_root *root,
 int btrfs_block_rsv_refill(struct btrfs_root *root,
  struct btrfs_block_rsv *block_rsv,
  u64 min_reserved);
+int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
+  struct btrfs_block_rsv *block_rsv,
+  u64 min_reserved);
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
struct btrfs_block_rsv *dst_rsv,
u64 num_bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0f47b3e..33f4b0b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3846,9 +3846,9 @@ int btrfs_block_rsv_check(struct btrfs_root *root,
return ret;
 }
 
-int btrfs_block_rsv_refill(struct btrfs_root *root,
- struct btrfs_block_rsv *block_rsv,
- u64 min_reserved)
+static inline int __btrfs_block_rsv_refill(struct btrfs_root *root,
+  struct btrfs_block_rsv *block_rsv,
+  u64 min_reserved, int flush)
 {
u64 num_bytes = 0;
int ret = -ENOSPC;
@@ -3867,7 +3867,7 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
if (!ret)
return 0;
 
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, 1);
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
@@ -3876,6 +3876,20 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+int btrfs_block_rsv_refill(struct btrfs_root *root,
+  struct btrfs_block_rsv *block_rsv,
+  u64 min_reserved)
+{
+   return __btrfs_block_rsv_refill(root, block_rsv, min_reserved, 1);
+}
+
+int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
+  struct btrfs_block_rsv *block_rsv,
+  u64 min_reserved)
+{
+   return __btrfs_block_rsv_refill(root, block_rsv, min_reserved, 0);
+}
+
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
struct btrfs_block_rsv *dst_rsv,
u64 num_bytes)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e16215f..d02 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3490,7 +3490,7 @@ void btrfs_evict_inode(struct inode *inode)
 * doing the truncate.
 */
while (1) {
-   ret = btrfs_block_rsv_refill(root, rsv, min_size);
+   ret = btrfs_block_rsv_refill_noflush(root, rsv, min_size);
 
/*
 * Try and steal from the global reserve since we will
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-18 Thread Steven Whitehouse
Hi,

Here is what I'm planning for GFS2:


Add sync of metadata after fallocate for O_SYNC files to ensure that we
meet expectations for everything being on disk in this case.
Unfortunately, the offset and len parameters are modified during the
course of the fallocate function, so I've had to add a couple of new
variables to call generic_write_sync() at the end.

I know that potentially this will sync data as well within the range,
but I think that is a fairly harmless side-effect overall, since we
would not normally expect there to be any dirty data within the range in
question.

Signed-off-by: Steven Whitehouse swhit...@redhat.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Benjamin Marzinski bmarz...@redhat.com

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6336bc6..9b6c6ac 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -752,6 +752,8 @@ static long gfs2_fallocate(struct file *file, int mode, 
loff_t offset,
loff_t bytes, max_bytes;
struct gfs2_alloc *al;
int error;
+   const loff_t pos = offset;
+   const loff_t count = len;
loff_t bsize_mask = ~((loff_t)sdp-sd_sb.sb_bsize - 1);
loff_t next = (offset + len - 1)  sdp-sd_sb.sb_bsize_shift;
loff_t max_chunk_size = UINT_MAX  bsize_mask;
@@ -834,6 +836,9 @@ retry:
gfs2_quota_unlock(ip);
gfs2_alloc_put(ip);
}
+
+   if (error == 0)
+   error = generic_write_sync(file, pos, count);
goto out_unlock;
 
 out_trans_fail:


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Segmentation Faults

2011-11-18 Thread Timothy Crone
Okay, I installed 3.1.1 and continue to oops when trying to delete a
directory. Please let me know if you would like any additional
information. I'm going to rebuild from a backup later today.
Cheers,Tim

On Wed, Nov 16, 2011 at 6:00 PM, David Sterba d...@jikos.cz wrote:
 Hi,

 On Wed, Nov 16, 2011 at 04:39:21PM +, Tim Crone wrote:
 root@berna:~# uname -a
 Linux berna 2.6.38-bpo.2-amd64 #1 SMP Mon Jun 6 15:24:02 UTC 2011
 x86_64 GNU/Linux

 2.6.38 is quite old from btrfs perspective and it's highly possible that
 the bug you've hit is already fixed. Try newer (like 3.1) kernel and if
 you still hit some sort of crash, please send the report.

 root@berna:/home/tjc# rm -rf .cache/chromium
 Segmentation fault
 root@berna:/home/tjc#
 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877568] [ cut here ]

 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877572] invalid opcode:  [#1] SMP

 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877573] last sysfs file: /sys/devices/virtual/block/md0/uevent

 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877633] Stack:

 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877640] Call Trace:

 Message from syslogd@localhost at Nov 16 11:19:35 ...
  kernel:[   66.877703] Code: 24 24 48 8b 74 24 28 48 8d 54 24 50 41 b9 01 00
 00 00 48 89 d9 4c 89 e7 e8 80 6e ff ff 83 f8 00 41 89 c5 0f 8c c5 02 00 00 74
 04 0f 0b eb fe 4c 8b 2b 8b 73 40 4c 89 ef e8 a3 e5 ff ff 41 89 c6

 btw such output is printed to every console when the crash occurs, but
 lacks important information like stacktrace and names of functions.
 without that it's very hard to get a clue what happened.

 (I was going to include some log entries, but the lines are longer than
 80 characters and your system will not accept them.)

 As for stacktrace and line length, I personally want to see the lines
 not wrapped at 80 (except the eg. byte dump of instructions), it helps
 readability.


 david




-- 
Timothy J. Crone
Lamont-Doherty Earth Observatory
Columbia University
61 Route 9W
Palisades, NY 10964
Tel: 845-365-8687

http://www.fluidcontinuity.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announcing btrfs-gui

2011-11-18 Thread Phillip Susi

On 6/1/2011 7:20 PM, Hugo Mills wrote:

Over the last few weeks, I've been playing with a foolish idea,
mostly triggered by a cluster of people being confused by btrfs's free
space reporting (df vs btrfs fi df vs btrfs fi show). I also wanted an
excuse, and some code, to mess around in the depths of the FS data
structures.

Like all silly ideas, this one got a bit out of hand, and seems to
have turned into something vaguely useful. I'm therefore pleased to
announce the first major public release of btrfs-gui[1]: a point-and-
click tool for managing btrfs filesystems.


This is a nice little tool.  The one suggestion that I have is that it 
display the actual chunks and where they are located.  It seems that 
right now it uses the same ioctl that btrfs fi df uses, and that just 
gets the total combined size for all chunks of a given type.  It would 
be nice if the gui actually parsed the chunk tree and showed each 
individual chunk with lines showing how they are mapped to the physical 
disks.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: wait on caching if we're loading the free space cache V2

2011-11-18 Thread Josef Bacik
We've been hitting panics when running xfstest 13 in a loop for long periods of
time.  And actually this problem has always existed so we've been hitting these
things randomly for a while.  Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go.  Then we get another thread that comes
in and tries to allocate from that block group.  Since block_group-cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation.  We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish.  The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.

The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache-cached !=
BTRFS_CACHE_FINISHED.  That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
V1-V2:
-fix memory leak
-fix hang because we were missing the cache-cached change when being woken up

 fs/btrfs/ctree.h   |3 +-
 fs/btrfs/extent-tree.c |  119 
 2 files changed, 81 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b1cb3c0..04a5dfc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -848,7 +848,8 @@ struct btrfs_free_cluster {
 enum btrfs_caching_type {
BTRFS_CACHE_NO  = 0,
BTRFS_CACHE_STARTED = 1,
-   BTRFS_CACHE_FINISHED= 2,
+   BTRFS_CACHE_FAST= 2,
+   BTRFS_CACHE_FINISHED= 3,
 };
 
 enum btrfs_disk_cache_state {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa4f602..24cfd10 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -467,13 +467,59 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,
 struct btrfs_root *root,
 int load_cache_only)
 {
+   DEFINE_WAIT(wait);
struct btrfs_fs_info *fs_info = cache-fs_info;
struct btrfs_caching_control *caching_ctl;
int ret = 0;
 
-   smp_mb();
-   if (cache-cached != BTRFS_CACHE_NO)
+   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
+   BUG_ON(!caching_ctl);
+
+   INIT_LIST_HEAD(caching_ctl-list);
+   mutex_init(caching_ctl-mutex);
+   init_waitqueue_head(caching_ctl-wait);
+   caching_ctl-block_group = cache;
+   caching_ctl-progress = cache-key.objectid;
+   atomic_set(caching_ctl-count, 1);
+   caching_ctl-work.func = caching_thread;
+
+   spin_lock(cache-lock);
+   /*
+* This should be a rare occasion, but this could happen I think in the
+* case where one thread starts to load the space cache info, and then
+* some other thread starts a transaction commit which tries to do an
+* allocation while the other thread is still loading the space cache
+* info.  The previous loop should have kept us from choosing this block
+* group, but if we've moved to the state where we will wait on caching
+* block groups we need to first check if we're doing a fast load here,
+* so we can wait for it to finish, otherwise we could end up allocating
+* from a block group who's cache gets evicted for one reason or
+* another.
+*/
+   while (cache-cached == BTRFS_CACHE_FAST) {
+   struct btrfs_caching_control *ctl;
+
+   ctl = cache-caching_ctl;
+   atomic_inc(ctl-count);
+   prepare_to_wait(ctl-wait, wait, TASK_UNINTERRUPTIBLE);
+   spin_unlock(cache-lock);
+
+   schedule();
+
+   finish_wait(ctl-wait, wait);
+   put_caching_control(ctl);
+   spin_lock(cache-lock);
+   }
+
+   if (cache-cached != BTRFS_CACHE_NO) {
+   spin_unlock(cache-lock);
+   kfree(caching_ctl);
return 0;
+   }
+   WARN_ON(cache-caching_ctl);
+   cache-caching_ctl = caching_ctl;
+   cache-cached = BTRFS_CACHE_FAST;
+   spin_unlock(cache-lock);
 
/*
 * We can't do the read from on-disk cache during a commit since we need
@@ -484,56 +530,51 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,
if (trans  (!trans-transaction-in_commit) 
(root  root != root-fs_info-tree_root) 
btrfs_test_opt(root, SPACE_CACHE)) {
-   spin_lock(cache-lock);
-

[PATCH v2] Btrfs: Don't error on resizing FS to same size

2011-11-18 Thread Mike Fleetwood
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed.  User app GParted trips
over this error.  Allow it by bypassing the shrink or grow operation.

Signed-off-by: Mike Fleetwood mike.fleetw...@googlemail.com
---

v2: Fix FS shrink prevention error spotted by Chris Samuel

Example failed resize:
# strace -e trace=ioctl btrfs filesystem resize max /mnt/0
Resize '/mnt/0' of 'max'
ioctl(3, 0x50009403, 0xbfa5029c)= -1 EINVAL (Invalid argument)
ERROR: unable to resize '/mnt/0' - Invalid argument
# echo $?
30
# dmesg | tail -1
[426094.235018] new size for /dev/loop1 is 1073741824

---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dae5dfe..bd32cce 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1251,7 +1251,7 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root 
*root,
}
ret = btrfs_grow_device(trans, device, new_size);
btrfs_commit_transaction(trans, root);
-   } else {
+   } else if (new_size  old_size) {
ret = btrfs_shrink_device(device, new_size);
}
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-18 Thread Stefan Kleijkers

Hello Josef,

I've two new dmesg's (ceph osd 0 and 1). Both filesystems wheren't 
responding anymore. Please let me know if you need more information or 
another run. Both are made with the 3.1.1 kernel and your patches 
applied (the patches with the extra warning messages).


Paste:
OSD.0: http://pastebin.com/tPjMk0kH
OSD.1: http://pastebin.com/ef7A16g2

Stefan

On 11/15/2011 08:29 PM, Josef Bacik wrote:

On Tue, Nov 15, 2011 at 08:13:43PM +0100, Stefan Kleijkers wrote:

Hello Josef,

We have patched the 3.1.1 kernel with your patch and after a short
time one of the ceph osds crashed (core dumped) and I found this in
the dmesg, please let me know if that's enough information or if you
need more.


Yeah I was hoping for a WARN_ON() that should have shown up before that, do you
have the entire dmesg?  Thanks,

Josef


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix num_start_workers count if we fail to make an alloc

2011-11-18 Thread Josef Bacik
Al pointed out that if we fail to start a worker for whatever reason (ENOMEM
basically), we could leak our count for num_start_workers, and so we'd think we
had more workers than we actually do.  This could cause us to shrink workers
when we shouldn't or not start workers when we should.  So check the return
value and if we failed fix num_start_workers and fallback.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/async-thread.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 7ec1409..09ef1b0 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -568,6 +568,7 @@ static struct btrfs_worker_thread *find_worker(struct 
btrfs_workers *workers)
struct btrfs_worker_thread *worker;
unsigned long flags;
struct list_head *fallback;
+   int ret;
 
 again:
spin_lock_irqsave(workers-lock, flags);
@@ -584,7 +585,13 @@ again:
workers-num_workers_starting++;
spin_unlock_irqrestore(workers-lock, flags);
/* we're below the limit, start another worker */
-   __btrfs_start_workers(workers, 1);
+   ret = __btrfs_start_workers(workers, 1);
+   if (ret) {
+   spin_lock_irqsave(workers-lock, flags);
+   workers-num_workers_starting--;
+   spin_unlock_irqrestore(workers-lock, flags);
+   goto fallback;
+   }
goto again;
}
}
-- 
1.7.5.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-18 Thread Josef Bacik
On Tue, Nov 15, 2011 at 09:19:12PM +0100, Stefan Kleijkers wrote:
 Hello Josef,
 
 You can find the complete dmesg paste on: http://pastebin.com/R4dFfSdQ
 
 But I doubt it will add more information.
 

Sorry I forgot about you :).  Here is a new debug patch, it will print something
out right before it panics so make sure to get that info, and it will also be
spitting stuff out to the trace buffer.  So make sure you have tracing enabled,
usually distros mount debugfs so you can cd /sys/kernel/debug/tracing and cat
trace and you should see something.  When it panics cat trace  somefile and
send me the file and the dmesg so I can figure out what went wrong.  Thanks,

Josef


diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 24cfd10..d887608 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3762,6 +3762,11 @@ static void block_rsv_release_bytes(struct 
btrfs_block_rsv *block_rsv,
spin_lock(block_rsv-lock);
if (num_bytes == (u64)-1)
num_bytes = block_rsv-size;
+   if (unlikely(block_rsv-size  num_bytes)) {
+   printk(KERN_ERR block rsv %p has size %Lu, want to take 
+  away %Lu\n, block_rsv, block_rsv-size, num_bytes);
+   BUG();
+   }
block_rsv-size -= num_bytes;
if (block_rsv-reserved = block_rsv-size) {
num_bytes = block_rsv-reserved - block_rsv-size;
@@ -4064,6 +4069,8 @@ int btrfs_orphan_reserve_metadata(struct 
btrfs_trans_handle *trans,
 * when we are truly done with the orphan item.
 */
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   trace_printk(Reserved %Lu bytes for inode %Lu on rsv %p\n,
+num_bytes, btrfs_ino(inode), dst_rsv);
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
 }
 
@@ -4071,6 +4078,8 @@ void btrfs_orphan_release_metadata(struct inode *inode)
 {
struct btrfs_root *root = BTRFS_I(inode)-root;
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   trace_printk(Releaseing %Lu bytes for inode %Lu on rsv %p\n,
+num_bytes, btrfs_ino(inode), root-orphan_block_rsv);
btrfs_block_rsv_release(root, root-orphan_block_rsv, num_bytes);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e16215f..38a4efb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1994,6 +1994,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
 
spin_lock(root-orphan_lock);
if (!root-orphan_block_rsv) {
+   trace_printk(Added new orphan rsv %p for root %Lu\n,
+block_rsv, root-objectid);
root-orphan_block_rsv = block_rsv;
} else if (block_rsv) {
btrfs_free_block_rsv(root, block_rsv);
@@ -2002,6 +2004,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
 
if (list_empty(BTRFS_I(inode)-i_orphan)) {
list_add(BTRFS_I(inode)-i_orphan, root-orphan_list);
+   trace_printk(Added inode %Lu to orphan list for %Lu\n,
+btrfs_ino(inode), root-objectid);
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2019,6 +2023,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (!BTRFS_I(inode)-orphan_meta_reserved) {
BTRFS_I(inode)-orphan_meta_reserved = 1;
reserve = 1;
+   } else {
+   trace_printk(Didn't do reservation for inode %Lu on %Lu\n,
+btrfs_ino(inode), root-objectid);
}
spin_unlock(root-orphan_lock);
 
@@ -2056,6 +2063,8 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
 
spin_lock(root-orphan_lock);
if (!list_empty(BTRFS_I(inode)-i_orphan)) {
+   trace_printk(Removed inode %Lu from orphan list for %Lu\n,
+btrfs_ino(inode), root-objectid);
list_del_init(BTRFS_I(inode)-i_orphan);
delete_item = 1;
}
@@ -2063,6 +2072,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (BTRFS_I(inode)-orphan_meta_reserved) {
BTRFS_I(inode)-orphan_meta_reserved = 0;
release_rsv = 1;
+   } else {
+   trace_printk(Inode %Lu didn't release metadata\n,
+btrfs_ino(inode));
}
spin_unlock(root-orphan_lock);
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix num_start_workers count if we fail to make an alloc

2011-11-18 Thread Al Viro
On Fri, Nov 18, 2011 at 02:38:54PM -0500, Josef Bacik wrote:
 Al pointed out that if we fail to start a worker for whatever reason (ENOMEM
 basically), we could leak our count for num_start_workers, and so we'd think 
 we
 had more workers than we actually do.  This could cause us to shrink workers
 when we shouldn't or not start workers when we should.  So check the return
 value and if we failed fix num_start_workers and fallback.  Thanks,

It's actually uglier than that; consider check_pending_workers_create()
where we
* bump the num_start_workers
* call start_new_worker(), which can fail, and then we have the same
leak; if it doesn't fail, it schedules a call of start_new_worker_func()
* when start_new_worker_func() runs, it does btrfs_start_workers(),
which can run into the same leak again (this time on another pool - one
we have as -async_helper).

Worse, __btrfs_start_workers() does btrfs_stop_workers() on failure.  That,
to put it mildly, is using excessive force.  As far as I can see, it's
_never_ the right thing to do - __btrfs_start_workers() is always getting
1 as the second argument, so even calls from mount path don't need that
kind of kill ones we'd already created if we fail halfway through.  It
used to make sense when they had all been started at mount time, but now
it's useless in the best case (mount) and destructive elsewhere (when
pool had already been non-empty).

So I'd suggest killing that call of btrfs_stop_workers() completely, losing
the num_workers argument (along with a loop in __btrfs_start_workers())
and looking into check_pending_workers_create() path.

Probably I'd put decrement on -num_workers_starting into failure exits of
__btrfs_start_workers() and start_new_worker(), but... can btrfs_queue_worker()
ever return non-zero?  AFAICS it can't and we could just merge
start_new_worker() into check_pending_workers() and pull allocation before
incrementing the -num_workers_starting...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARNING: at fs/btrfs/inode.c:2408 btrfs_orphan_cleanup

2011-11-18 Thread Gregory Farnum
I'm running Ceph OSDs on btrfs and have managed to corrupt several of
them so that on mount I get an error:
root@cephstore6356:~# mount /dev/sde1 /mnt/osd.2/
2011 Nov 18 10:44:52 cephstore6356 [68494.771472] btrfs: could not do
orphan cleanup -116
mount: Stale NFS file handle

Attempting to mount again works, though:
root@cephstore6356:~# mount /dev/sde1 /mnt/osd.2/
root@cephstore6356:~# ls /mnt/osd.2/
async_snap_test  ceph_fsid  current  fsid  keyring  magic
snap_5014103  snap_5027478  snap_5031904  store_version  whoami

However, once I start up the ceph-osd daemon (or do much of anything
else) I get a repeating warning:
[  715.820406] [ cut here ]
[  715.820409] WARNING: at fs/btrfs/inode.c:2408
btrfs_orphan_cleanup+0x1f1/0x2f5()
[  715.820410] Hardware name: PowerEdge R510
[  715.820411] Modules linked in:
[  715.820413] Pid: 13238, comm: ceph-osd Tainted: GW
3.1.0-dho-4-g1ffcb5c-dirty #1
[  715.820414] Call Trace:
[  715.820416]  [8103c645] ? warn_slowpath_common+0x78/0x8c
[  715.820419]  [812372e3] ? btrfs_orphan_cleanup+0x1f1/0x2f5
[  715.820422]  [8123776c] ? btrfs_lookup_dentry+0x385/0x3ee
[  715.820425]  [810e2bd1] ? __d_lookup+0x71/0x108
[  715.820427]  [812377e2] ? btrfs_lookup+0xd/0x43
[  715.820429]  [810db695] ? d_inode_lookup+0x22/0x3c
[  715.820431]  [810dbd92] ? do_lookup+0x1f7/0x2e3
[  715.820434]  [810dc81a] ? link_path_walk+0x1a5/0x709
[  715.820436]  [810b95fb] ? __do_fault+0x40f/0x44d
[  715.820439]  [810ded62] ? path_openat+0xac/0x358
[  715.820441]  [810df0db] ? do_filp_open+0x2c/0x72
[  715.820444]  [810e82c0] ? alloc_fd+0x69/0x10a
[  715.820446]  [810d1cb9] ? do_sys_open+0x103/0x18a
[  715.820449]  [8166c07b] ? system_call_fastpath+0x16/0x1b
[  715.820450] ---[ end trace dd9e40fabcd2d83c ]---

Pretty shortly afterwards the machine goes completely unresponsive and
I need to powercycle it (in fact I believe I managed to go from one to
three dead filesystems doing this). Googe only found me one related
reference (http://comments.gmane.org/gmane.comp.file-systems.btrfs/12501)
that didn't have a solution (and the backtrace was different anyway).
This is running tag 3.1 plus the current btrfs/for-linus branch.

Any advice, solutions, requests for other information, etc are much
appreciated. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix num_start_workers count if we fail to make an alloc

2011-11-18 Thread Al Viro
On Fri, Nov 18, 2011 at 08:20:56PM +, Al Viro wrote:
 On Fri, Nov 18, 2011 at 02:38:54PM -0500, Josef Bacik wrote:
  Al pointed out that if we fail to start a worker for whatever reason (ENOMEM
  basically), we could leak our count for num_start_workers, and so we'd 
  think we
  had more workers than we actually do.  This could cause us to shrink workers
  when we shouldn't or not start workers when we should.  So check the return
  value and if we failed fix num_start_workers and fallback.  Thanks,
 
 It's actually uglier than that; consider check_pending_workers_create()
 where we
   * bump the num_start_workers
   * call start_new_worker(), which can fail, and then we have the same
 leak; if it doesn't fail, it schedules a call of start_new_worker_func()
   * when start_new_worker_func() runs, it does btrfs_start_workers(),
 which can run into the same leak again (this time on another pool - one
 we have as -async_helper).

Nuts...  AFAICS, we _always_ leak -num_start_workers here (i.e. when
check_pending_workers_create() finds -atomic_start_pending set).  In
effect, we bump it once in check_pending_workers_create() itself, then
another time (on the same pool) when start_new_worker_func() calls
btrfs_start_workers().  That one will be dropped when we manage to 
start the thread, but the first one won't.

Shouldn't we use __btrfs_start_workers() instead here?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix num_start_workers count if we fail to make an alloc

2011-11-18 Thread Al Viro
On Sat, Nov 19, 2011 at 01:37:39AM +, Al Viro wrote:
 On Fri, Nov 18, 2011 at 08:20:56PM +, Al Viro wrote:
  On Fri, Nov 18, 2011 at 02:38:54PM -0500, Josef Bacik wrote:
   Al pointed out that if we fail to start a worker for whatever reason 
   (ENOMEM
   basically), we could leak our count for num_start_workers, and so we'd 
   think we
   had more workers than we actually do.  This could cause us to shrink 
   workers
   when we shouldn't or not start workers when we should.  So check the 
   return
   value and if we failed fix num_start_workers and fallback.  Thanks,
  
  It's actually uglier than that; consider check_pending_workers_create()
  where we
  * bump the num_start_workers
  * call start_new_worker(), which can fail, and then we have the same
  leak; if it doesn't fail, it schedules a call of start_new_worker_func()
  * when start_new_worker_func() runs, it does btrfs_start_workers(),
  which can run into the same leak again (this time on another pool - one
  we have as -async_helper).
 
 Nuts...  AFAICS, we _always_ leak -num_start_workers here (i.e. when
 check_pending_workers_create() finds -atomic_start_pending set).  In
 effect, we bump it once in check_pending_workers_create() itself, then
 another time (on the same pool) when start_new_worker_func() calls
 btrfs_start_workers().  That one will be dropped when we manage to 
 start the thread, but the first one won't.
 
 Shouldn't we use __btrfs_start_workers() instead here?

OK, tentative fixes for that stuff are pushed into #btrfs in vfs.git; comments
would be welcome...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html