Re: warning and bug on 3.2-rc4 + for-linus from yesterday

2011-12-09 Thread Simon Kirby
On Fri, Dec 09, 2011 at 12:39:48PM -0800, Simon Kirby wrote:

> Hello!
> 
> We recently upgraded our backup server kernel (rsync with snapshots and
> compression) to Linus git master from yesterday (3.2-rc4+ 09d9673d53005)
> that contains the btrfs for-linus as of yesterday. We've been seeing a
> few warnings and bugs:

Then it kept pinging but didn't accept SSH anymore, with this captured
via serial console:

[79214.481458] [ cut here ]
[79214.485335] kernel BUG at fs/btrfs/inode.c:2893!
[79214.485335] invalid opcode:  [#2] SMP
[79214.485335] CPU 0
[79214.485335] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe bnx2
[79214.485335]
[79214.485335] Pid: 24202, comm: btrfsctl Tainted: G  D W3.2.0-rc4-hw+ 
#71 Dell Inc. PowerEdge 1950/0NK937
[79214.485335] RIP: 0010:[]  [] 
btrfs_unlink_subvol+0x268/0x270
[79214.485335] RSP: 0018:880344babd28  EFLAGS: 00010286
[79214.485335] RAX: ffe4 RBX: 0c46 RCX: 880336fd1588
[79214.485335] RDX: ffe4 RSI:  RDI: 880336fd15a8
[79214.485335] RBP: 880344babda8 R08:  R09: 
[79214.485335] R10:  R11: 9001 R12: 880405cf5e88
[79214.485335] R13: 880428a9ba20 R14: 880405158c00 R15: 0100
[79214.485335] FS:  7f27ff13d740() GS:88043fc0() 
knlGS:
[79214.485335] CS:  0010 DS:  ES:  CR0: 8005003b
[79214.485335] CR2: 7fffdf79f950 CR3: 0003f79fe000 CR4: 06f0
[79214.485335] DR0:  DR1:  DR2: 
[79214.485335] DR3:  DR6: 0ff0 DR7: 0400
[79214.485335] Process btrfsctl (pid: 24202, threadinfo 880344baa000, task 
8803dcec)
[79214.485335] Stack:
[79214.485335]  8804037d53f8 88030010 001044babd58 
8804037d53f8
[79214.485335]  08a0 8803fd8b43f0 08a0 
ff84
[79214.485335]  00ff 0268 880037e73008 

[79214.485335] Call Trace:
[79214.485335]  [] btrfs_ioctl_snap_destroy+0x3b5/0x480
[79214.485335]  [] btrfs_ioctl+0x3a2/0x10d0
[79214.485335]  [] ? do_page_fault+0x254/0x4b0
[79214.485335]  [] do_vfs_ioctl+0xa0/0x520
[79214.485335]  [] sys_ioctl+0x4a/0x80
[79214.485335]  [] system_call_fastpath+0x16/0x1b
[79214.485335] Code: 48 8d 54 92 65 e8 89 f2 00 00 48 8b 5d b9 4c 89 ef e8 4d 
2c fd ff 48 89 5d c8 e9 ca fe ff ff 0f 0b eb fe 0f 0b eb fe 0f 1f 40 00 <0f> 0b 
eb fe 0f 0b eb fe 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c
[79214.485335] RIP  [] btrfs_unlink_subvol+0x268/0x270
[79214.485335]  RSP 
[79214.700401] ---[ end trace 52453f1ad38744ba ]---

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


warning and bug on 3.2-rc4 + for-linus from yesterday

2011-12-09 Thread Simon Kirby
Hello!

We recently upgraded our backup server kernel (rsync with snapshots and
compression) to Linus git master from yesterday (3.2-rc4+ 09d9673d53005)
that contains the btrfs for-linus as of yesterday. We've been seeing a
few warnings and bugs:

[ cut here ]
WARNING: at mm/page-writeback.c:1763 __set_page_dirty_nobuffers+0x17b/0x190()
Hardware name: PowerEdge 1950
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe bnx2
Pid: 14299, comm: btrfs-delalloc- Tainted: GW3.2.0-rc4-hw+ #71
Call Trace:
 [] ? __set_page_dirty_nobuffers+0x17b/0x190
 [] warn_slowpath_common+0x80/0xc0
 [] warn_slowpath_null+0x15/0x20
 [] __set_page_dirty_nobuffers+0x17b/0x190
 [] compress_file_range+0x535/0x5e0
 [] ? kfree+0xee/0x120
 [] async_cow_start+0x30/0x50
 [] worker_loop+0x173/0x530
 [] ? btrfs_queue_worker+0x310/0x310
 [] ? btrfs_queue_worker+0x310/0x310
 [] kthread+0x96/0xb0
 [] kernel_thread_helper+0x4/0x10
 [] ? kthread_worker_fn+0x190/0x190
 [] ? gs_change+0x13/0x13
---[ end trace 52453f1ad38744b8 ]---

(several hours later)

[ cut here ]
kernel BUG at fs/btrfs/inode.c:1587!
invalid opcode:  [#1] SMP
CPU 2
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe bnx2

Pid: 4477, comm: btrfs-fixup-0 Tainted: GW3.2.0-rc4-hw+ #71 Dell 
Inc. PowerEdge 1950/0NK937
RIP: 0010:[]  [] 
btrfs_writepage_fixup_worker+0x160/0x170
RSP: 0018:88040ff1dde0  EFLAGS: 00010246
RAX:  RBX: 013d6000 RCX: 
RDX: 0065 RSI: 013d6000 RDI: 8800996fe8e0
RBP: 88040ff1de30 R08: 88040ff1dd34 R09: 88040ff1dda0
R10: dead00200200 R11:  R12: ea000ea54840
R13: 013d6fff R14: 8800996fe9b0 R15: 8800996fe850
FS:  () GS:88043fc8() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 02051c80 CR3: 000427492000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process btrfs-fixup-0 (pid: 4477, threadinfo 88040ff1c000, task 
8804135b9630)
Stack:
 8106b2c0 880261a9bae0  0286
 8801b10103f0 880261a9bae8 880261a9bb10 880412f606c0
 880412f60710 880412f606d8 88040ff1dee0 813220a3
Call Trace:
 [] ? del_timer+0xd0/0xd0
 [] worker_loop+0x173/0x530
 [] ? btrfs_queue_worker+0x310/0x310
 [] ? btrfs_queue_worker+0x310/0x310
 [] kthread+0x96/0xb0
 [] kernel_thread_helper+0x4/0x10
 [] ? kthread_worker_fn+0x190/0x190
 [] ? gs_change+0x13/0x13
Code: 5d 41 5e 41 5f c9 c3 0f 1f 40 00 48 8d 4d d0 41 b8 50 00 00 00 4c 89 ea 
48 89 de 4c 89 ff e8 f8 c1 01 00 eb ba 66 0f 1f 44 00 00 <0f> 0b eb fe 66 66 66 
2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41
RIP  [] btrfs_writepage_fixup_worker+0x160/0x170
 RSP 
---[ end trace 52453f1ad38744b9 ]---

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix leaked space in truncate

2011-12-09 Thread Josef Bacik
We were occasionaly leaking space when running xfstest 269.  This is because if
we failed to start the transaction in the truncate loop we'd just goto out, but
we need to break so that the inode is removed from the orphan list and the space
is properly freed.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3e3ca2..ae5b354a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6581,8 +6581,9 @@ static int btrfs_truncate(struct inode *inode)
/* Just need the 1 for updating the inode */
trans = btrfs_start_transaction(root, 1);
if (IS_ERR(trans)) {
-   err = PTR_ERR(trans);
-   goto out;
+   ret = err = PTR_ERR(trans);
+   trans = NULL;
+   break;
}
}
 
-- 
1.7.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG during btrfs device delete missing

2011-12-09 Thread Chris Mason
On Thu, Dec 08, 2011 at 12:27:52PM -0800, David Marcin wrote:
> Hi Chris,
> This was on 3.2-rc2 but I tried with rc4 and it segfaulted again.  I
> think the traces were the same but I've rebooted and can't say for
> sure.
> David
> On Thu, Dec 8, 2011 at 11:45 AM, Chris Mason  wrote:
> > Which kernel is this?  This looks like one I recently fixed.
> >
> > -chris
> >
> > On Thu, Dec 08, 2011 at 11:06:47AM -0800, David Marcin wrote:
> >> raid10 metadata and data filesystem.  dmesg log follows.  The system
> >> is unable to unmount the filesystem after this occurs.
> >>
> >> Filesystem mounted at/mnt/btrfs with -o compress,degraded
> >> Command: btrfs device delete missing /mnt/btrfs
> >>
> >> [  283.398222] [ cut here ]
> >> [  283.398289] kernel BUG at 
> >> /home/apw/COD/linux/fs/btrfs/transaction.c:1329!

So this crash means we failed to write all the blocks required to commit
the transaction.  The reason is that we're getting failed bios to the
missing device, and that failure isn't properly eaten by the
raid aware endio code.

If you pull the top commit from my for-linus branch, it should all work.

I know you've got a big FS here, I haven't tested this on raid10 yet,
only raid1.  If you want to wait a bit for safety I'll do a raid10 run
too.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix how we do delalloc reservations and how we free reservations on error

2011-12-09 Thread Josef Bacik
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved.  This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such.  The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for.  The other problem is with our error handling
in the reservation code.  There are two cases that we need to deal with

1) We raced with free.  In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation.  However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use.  So as it stands the code was
doing this fine and it worked out, except in case #2

2) We don't race with free.  Nobody comes in and changes anything, and our
reservation fails.  In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything.  So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.

Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex.  We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things.  With this patch my space leak scripts no longer
scream bloody murder.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c |   43 ++-
 fs/btrfs/inode-map.c   |2 ++
 fs/btrfs/inode.c   |   10 ++
 fs/btrfs/ioctl.c   |2 ++
 fs/btrfs/relocation.c  |2 ++
 5 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 24cfd10..6dd0406 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4189,10 +4189,15 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
u64 to_reserve = 0;
+   u64 csum_bytes;
unsigned nr_extents = 0;
+   int extra_reserve = 0;
int flush = 1;
int ret;
 
+   /* Need to be holding the i_mutex here */
+   WARN_ON(!mutex_is_locked(&inode->i_mutex));
+
if (btrfs_is_free_space_inode(root, inode))
flush = 0;
 
@@ -4205,11 +4210,9 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
BTRFS_I(inode)->outstanding_extents++;
 
if (BTRFS_I(inode)->outstanding_extents >
-   BTRFS_I(inode)->reserved_extents) {
+   BTRFS_I(inode)->reserved_extents)
nr_extents = BTRFS_I(inode)->outstanding_extents -
BTRFS_I(inode)->reserved_extents;
-   BTRFS_I(inode)->reserved_extents += nr_extents;
-   }
 
/*
 * Add an item to reserve for updating the inode when we complete the
@@ -4217,11 +4220,12 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
 */
if (!BTRFS_I(inode)->delalloc_meta_reserved) {
nr_extents++;
-   BTRFS_I(inode)->delalloc_meta_reserved = 1;
+   extra_reserve = 1;
}
 
to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
+   csum_bytes = BTRFS_I(inode)->csum_bytes;
spin_unlock(&BTRFS_I(inode)->lock);
 
ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
@@ -4231,22 +4235,35 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
 
spin_lock(&BTRFS_I(inode)->lock);
dropped = drop_outstanding_extent(inode);
-   to_free = calc_csum_metadata_size(inode, num_bytes, 0);
-   spin_unlock(&BTRFS_I(inode)->lock);
-   to_free += btrfs_calc_trans_metadata_size(root, dropped);
-
/*
-* Somebody could have come in and twiddled with the
-* reservation, so if we have to free more than we would have
-* reserved from this reservation go ahead and release those
-* bytes.
+* If the inodes csum_bytes is the same as the original
+* csum_bytes then 

[PATCH 3/3] Btrfs: read device stats on mount, write modified ones during commit

2011-12-09 Thread Stefan Behrens
The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistic for each involved
device are read from the device tree and used to initialize the
counters.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/ctree.h   |   51 
 fs/btrfs/disk-io.c |7 ++
 fs/btrfs/print-tree.c  |3 +
 fs/btrfs/transaction.c |4 +
 fs/btrfs/volumes.c |  205 
 fs/btrfs/volumes.h |9 ++
 6 files changed, 279 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 89fab53..f5e2429 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -750,6 +750,26 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+struct btrfs_device_stats_item {
+   /*
+* grow this item struct at the end for future enhancements and keep
+* the existing values unchanged
+*/
+   __le64 cnt_write_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __le64 cnt_read_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __le64 cnt_flush_io_errs; /* EIO or EREMOTEIO from lower layers */
+
+   /* stats for indirect indications for I/O failures */
+   __le64 cnt_corruption_errs; /* checksum error, bytenr error or
+* contents is illegal: this is an
+* indication that the block was damaged
+* during read or write, or written to
+* wrong location or read from wrong
+* location */
+   __le64 cnt_generation_errs; /* an indication that blocks have not
+* been written */
+} __attribute__ ((__packed__));
+
 /* different types of block groups (and chunks) */
 #define BTRFS_BLOCK_GROUP_DATA (1 << 0)
 #define BTRFS_BLOCK_GROUP_SYSTEM   (1 << 1)
@@ -1388,6 +1408,12 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Persistantly stores the io stats in the device tree.
+ * One key for all stats, (0, BTRFS_DEVICE_STATS_KEY, devid).
+ */
+#define BTRFS_DEVICE_STATS_KEY 248
+
+/*
  * string items are for debugging.  They just store a short string of
  * data in the FS
  */
@@ -2202,6 +2228,31 @@ static inline u32 
btrfs_file_extent_inline_item_len(struct extent_buffer *eb,
return btrfs_item_size(eb, e) - offset;
 }
 
+/* btrfs_device_stats_item */
+BTRFS_SETGET_FUNCS(device_stats_cnt_write_io_errs,
+  struct btrfs_device_stats_item, cnt_write_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_read_io_errs,
+  struct btrfs_device_stats_item, cnt_read_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_flush_io_errs,
+  struct btrfs_device_stats_item, cnt_flush_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_corruption_errs,
+  struct btrfs_device_stats_item, cnt_corruption_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_generation_errs,
+  struct btrfs_device_stats_item, cnt_generation_errs, 64);
+
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_write_io_errs,
+struct btrfs_device_stats_item, cnt_write_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_read_io_errs,
+struct btrfs_device_stats_item, cnt_read_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_flush_io_errs,
+struct btrfs_device_stats_item, cnt_flush_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_corruption_errs,
+struct btrfs_device_stats_item, cnt_corruption_errs,
+64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_generation_errs,
+struct btrfs_device_stats_item, cnt_generation_errs,
+64);
+
 static inline struct btrfs_root *btrfs_sb(struct super_block *sb)
 {
return sb->s_fs_info;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0f2a37..cac8f51 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2321,6 +2321,13 @@ retry_root_backup:
fs_info->metadata_alloc_profile = (u64)-1;
fs_info->system_alloc_profile = fs_info->metadata_alloc_profile;
 
+   ret = btrfs_init_device_stats(fs_info);
+   if (ret) {
+   printk(KERN_ERR "btrfs: failed to init device_stats: %d\n",
+  ret);
+   goto fail_block_groups;
+   }
+
ret = btrfs_init_space_info(fs_info);
if (ret) {
printk(KERN_ERR "Failed to initial space info: %d\n", ret);
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index f38e452..a9e45e4 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -294,6 +294,9 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffe

[PATCH 1/3] Btrfs: add device counters for detected IO and checksum errors

2011-12-09 Thread Stefan Behrens
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/disk-io.c   |   18 +++---
 fs/btrfs/extent_io.c |   27 -
 fs/btrfs/scrub.c |   52 +++---
 fs/btrfs/volumes.c   |   61 +++--
 fs/btrfs/volumes.h   |   21 +
 5 files changed, 161 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89094ee..b0f2a37 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2511,18 +2511,24 @@ recovery_tree_root:
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-   char b[BDEVNAME_SIZE];
-
if (uptodate) {
set_buffer_uptodate(bh);
} else {
+   struct btrfs_device *device = (struct btrfs_device *)
+   (((uintptr_t) bh->b_private) & ~((uintptr_t) 1));
+   unsigned int with_flush = ((uintptr_t) bh->b_private) & 1;
+
printk_ratelimited(KERN_WARNING "lost page write due to "
-   "I/O error on %s\n",
-  bdevname(bh->b_bdev, b));
+  "I/O error on %s\n", device->name);
/* note, we dont' set_buffer_write_io_error because we have
 * our own ways of dealing with the IO errors
 */
clear_buffer_uptodate(bh);
+   btrfs_device_stat_inc(&device->cnt_write_io_errs);
+   if (with_flush)
+   btrfs_device_stat_inc(&device->cnt_flush_io_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(device);
}
unlock_buffer(bh);
put_bh(bh);
@@ -2637,6 +2643,7 @@ static int write_dev_supers(struct btrfs_device *device,
set_buffer_uptodate(bh);
lock_buffer(bh);
bh->b_end_io = btrfs_end_buffer_write_sync;
+   bh->b_private = device;
}
 
/*
@@ -2695,6 +2702,9 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
}
if (!bio_flagged(bio, BIO_UPTODATE)) {
ret = -EIO;
+   btrfs_device_stat_inc(&device->cnt_flush_io_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(device);
}
 
/* drop the reference from the wait == 0 run */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7609d28..566d262 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1894,6 +1894,9 @@ int repair_io_failure(struct btrfs_mapping_tree 
*map_tree, u64 start,
if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
/* try to remap that extent elsewhere? */
bio_put(bio);
+   btrfs_device_stat_inc(&dev->cnt_write_io_errs);
+   dev->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(dev);
return -EIO;
}
 
@@ -2280,10 +2283,30 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
if (uptodate && tree->ops && tree->ops->readpage_end_io_hook) {
ret = tree->ops->readpage_end_io_hook(page, start, end,
  state);
-   if (ret)
+   if (ret) {
+   /* no IO indicated but software detected errors
+* in the block, either checksum errros or
+* issues with the contents */
+   int failed_mirror = (int)(uintptr_t)
+   bio->bi_bdev;
+   struct btrfs_root *root =
+   BTRFS_I(page->mapping->host)->root;
+   struct btrfs_device *device;
+
uptodate = 0;
-   else
+   device = btrfs_find_device_for_logical(
+   root, start,
+   (int)failed_mirror);
+   if (device) {
+   btrfs_device_stat_inc(
+   &device->cnt_corruption_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_pri

[PATCH 2/3] Btrfs: add ioctl to get and reset the device stats

2011-12-09 Thread Stefan Behrens
An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/ioctl.c   |   26 +++
 fs/btrfs/ioctl.h   |   27 
 fs/btrfs/volumes.c |   69 
 fs/btrfs/volumes.h |   13 ++
 4 files changed, 135 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 72d4616..bce3f92 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2891,6 +2891,28 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_root 
*root,
return ret;
 }
 
+static long btrfs_ioctl_get_device_stats(struct btrfs_root *root,
+void __user *arg, int reset_after_read)
+{
+   struct btrfs_ioctl_get_device_stats *sa;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   sa = memdup_user(arg, sizeof(*sa));
+   if (IS_ERR(sa))
+   return PTR_ERR(sa);
+
+   ret = btrfs_get_device_stats(root, sa, reset_after_read);
+
+   if (copy_to_user(arg, sa, sizeof(*sa)))
+   ret = -EFAULT;
+
+   kfree(sa);
+   return ret;
+}
+
 static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg)
 {
int ret = 0;
@@ -3108,6 +3130,10 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_scrub_cancel(root, argp);
case BTRFS_IOC_SCRUB_PROGRESS:
return btrfs_ioctl_scrub_progress(root, argp);
+   case BTRFS_IOC_GET_DEVICE_STATS:
+   return btrfs_ioctl_get_device_stats(root, argp, 0);
+   case BTRFS_IOC_GET_AND_RESET_DEVICE_STATS:
+   return btrfs_ioctl_get_device_stats(root, argp, 1);
}
 
return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 252ae99..b9ffd0b 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -217,6 +217,29 @@ struct btrfs_ioctl_logical_ino_args {
__u64   inodes;
 };
 
+#define BTRFS_IOCTL_GET_DEVICE_STATS_MAX_NR_ITEMS  5
+struct btrfs_ioctl_get_device_stats {
+   __u64 devid;/* in */
+   __u64 nr_items; /* in/out */
+
+   /* out values: */
+
+   /* disk I/O failure stats */
+   __u64 cnt_write_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __u64 cnt_read_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __u64 cnt_flush_io_errs; /* EIO or EREMOTEIO from lower layers */
+
+   /* stats for indirect indications for I/O failures */
+   __u64 cnt_corruption_errs; /* checksum error, bytenr error or
+   * contents is illegal: this is an
+   * indication that the block was damaged
+   * during read or write, or written to
+   * wrong location or read from wrong
+   * location */
+   __u64 cnt_generation_errs; /* an indication that blocks have not
+   * been written */
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -276,5 +299,9 @@ struct btrfs_ioctl_logical_ino_args {
struct btrfs_ioctl_ino_path_args)
 #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
struct btrfs_ioctl_ino_path_args)
+#define BTRFS_IOC_GET_DEVICE_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
+struct btrfs_ioctl_get_device_stats)
+#define BTRFS_IOC_GET_AND_RESET_DEVICE_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
+struct btrfs_ioctl_get_device_stats)
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cc21e14..99dfd00 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3852,3 +3852,72 @@ void btrfs_device_stat_print_on_error(struct 
btrfs_device *device)
   btrfs_device_stat_read(
&device->cnt_generation_errs));
 }
+
+int btrfs_get_device_stats(struct btrfs_root *root,
+  struct btrfs_ioctl_get_device_stats *stats,
+  int reset_after_read)
+{
+   struct btrfs_device *dev;
+   struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+
+   mutex_lock(&fs_devices->device_list_mutex);
+   dev = btrfs_find_device(root, stats->devid, NULL, NULL);
+   mutex_unlock(&fs_devices->device_list_mutex);
+
+   if (!dev) {
+   printk(KERN_WARNING
+  "btrfs: get device_stats failed, device not found\n");
+   return -ENODEV;
+   } else if (reset_after_read) {
+   if (

[PATCH 0/3] Btrfs: add IO error device stats

2011-12-09 Thread Stefan Behrens
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistic for each involved
device are read from the device tree and used to initialize the
counters.

A patch for the btrfs-progs world will also be sent.

The patches are based on v3.1-161-gf4a8e65 (btrfs pull request from
12/1/2011).

Stefan Behrens (3):
  Btrfs: add device counters for detected IO and checksum errors
  Btrfs: add ioctl to get and reset the device stats
  Btrfs: read device stats on mount, write modified ones during commit

 fs/btrfs/ctree.h   |   51 
 fs/btrfs/disk-io.c |   25 +++-
 fs/btrfs/extent_io.c   |   27 -
 fs/btrfs/ioctl.c   |   26 
 fs/btrfs/ioctl.h   |   27 
 fs/btrfs/print-tree.c  |3 +
 fs/btrfs/scrub.c   |   52 ++--
 fs/btrfs/transaction.c |4 +
 fs/btrfs/volumes.c |  335 +++-
 fs/btrfs/volumes.h |   43 ++
 10 files changed, 575 insertions(+), 18 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: add command to get/reset device stats via ioctl

2011-12-09 Thread Stefan Behrens
"btrfs device stats" is used to retrieve and print the device stats.
"btrfs device stats -z" is used atomically retrieve, reset and print
the stats.

Signed-off-by: Stefan Behrens 
---
 Makefile |4 +-
 btrfs.c  |5 ++
 btrfs_cmds.c |   67 +
 btrfs_cmds.h |5 ++
 ctree.h  |6 +++
 devstats.c   |  131 ++
 ioctl.h  |   28 
 print-tree.c |7 +++
 scrub.c  |   74 +---
 9 files changed, 254 insertions(+), 73 deletions(-)

diff --git a/Makefile b/Makefile
index eeb92ad..c7ad82b 100644
--- a/Makefile
+++ b/Makefile
@@ -36,8 +36,8 @@ all: version $(progs) manpages
 version:
bash version.sh
 
-btrfs: $(objects) btrfs.o btrfs_cmds.o scrub.o
-   $(CC) $(CFLAGS) -o btrfs btrfs.o btrfs_cmds.o scrub.o \
+btrfs: $(objects) btrfs.o btrfs_cmds.o scrub.o devstats.o
+   $(CC) $(CFLAGS) -o btrfs btrfs.o btrfs_cmds.o scrub.o devstats.o \
$(objects) $(LDFLAGS) $(LIBS) -lpthread
 
 calc-size: $(objects) calc-size.o
diff --git a/btrfs.c b/btrfs.c
index 1def354..078729a 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -159,6 +159,11 @@ static struct Command commands[] = {
"filesystem.",
  NULL
},
+   { do_device_stats, -1,
+ "device stats", "[-z] |\n"
+   "Show current device IO stats. -z to reset stats afterwards.",
+ NULL
+   },
{ do_add_volume, -2,
  "device add", " [...] \n"
"Add a device to a filesystem.",
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index b59e9cb..065e103 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -117,6 +117,73 @@ int open_file_or_dir(const char *fname)
return fd;
 }
 
+int get_device_info(int fd, u64 devid,
+   struct btrfs_ioctl_dev_info_args *di_args)
+{
+   int ret;
+
+   di_args->devid = devid;
+   memset(&di_args->uuid, '\0', sizeof(di_args->uuid));
+
+   ret = ioctl(fd, BTRFS_IOC_DEV_INFO, di_args);
+   return ret ? -errno : 0;
+}
+
+int get_fs_info(int fd, char *path, struct btrfs_ioctl_fs_info_args *fi_args,
+   struct btrfs_ioctl_dev_info_args **di_ret)
+{
+   int ret = 0;
+   int ndevs = 0;
+   int i = 1;
+   struct btrfs_fs_devices *fs_devices_mnt = NULL;
+   struct btrfs_ioctl_dev_info_args *di_args;
+   char mp[BTRFS_PATH_NAME_MAX + 1];
+
+   memset(fi_args, 0, sizeof(*fi_args));
+
+   ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
+   if (ret && (errno == EINVAL || errno == ENOTTY)) {
+   /* path is not a mounted btrfs. Try if it's a device */
+   ret = check_mounted_where(fd, path, mp, sizeof(mp),
+ &fs_devices_mnt);
+   if (!ret)
+   return -EINVAL;
+   if (ret < 0)
+   return ret;
+   fi_args->num_devices = 1;
+   fi_args->max_id = fs_devices_mnt->latest_devid;
+   i = fs_devices_mnt->latest_devid;
+   memcpy(fi_args->fsid, fs_devices_mnt->fsid, BTRFS_FSID_SIZE);
+   close(fd);
+   fd = open_file_or_dir(mp);
+   if (fd < 0)
+   return -errno;
+   } else if (ret) {
+   return -errno;
+   }
+
+   if (!fi_args->num_devices)
+   return 0;
+
+   di_args = *di_ret = malloc(fi_args->num_devices * sizeof(*di_args));
+   if (!di_args)
+   return -errno;
+
+   for (; i <= fi_args->max_id; ++i) {
+   BUG_ON(ndevs >= fi_args->num_devices);
+   ret = get_device_info(fd, i, &di_args[ndevs]);
+   if (ret == -ENODEV)
+   continue;
+   if (ret)
+   return ret;
+   ndevs++;
+   }
+
+   BUG_ON(ndevs == 0);
+
+   return 0;
+}
+
 static u64 parse_size(char *s)
 {
int len = strlen(s);
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index 81182b1..6be9cc5 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -41,4 +41,9 @@ int do_change_label(int argc, char **argv);
 int open_file_or_dir(const char *fname);
 int do_ino_to_path(int nargs, char **argv);
 int do_logical_to_ino(int nargs, char **argv);
+int do_device_stats(int nargs, char **argv);
+int get_device_info(int fd, u64 devid,
+   struct btrfs_ioctl_dev_info_args *di_args);
+int get_fs_info(int fd, char *path, struct btrfs_ioctl_fs_info_args *fi_args,
+   struct btrfs_ioctl_dev_info_args **di_ret);
 char *path_for_root(int fd, u64 root);
diff --git a/ctree.h b/ctree.h
index 54748c8..12a0603 100644
--- a/ctree.h
+++ b/ctree.h
@@ -912,6 +912,12 @@ struct btrfs_root {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Persistantly stores the io stats in the device tree.
+ * One key for all stats, (0, BTRFS_DEVICE_STATS_KEY, devid).
+ */
+#define BTRFS_DEVICE_STATS_KEY 248
+
+/*
  *

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-09 Thread Christian Brunner
2011/12/7 Christian Brunner :
> 2011/12/1 Christian Brunner :
>> 2011/12/1 Alexandre Oliva :
>>> On Nov 29, 2011, Christian Brunner  wrote:
>>>
 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.
>>>
>>> That's unexpected.
>
> In the mean time I know, that it's not related to the reads.
>
>>> I suppose I could wave my hands while explaining that you're getting
>>> higher data throughput, so it's natural that it would take up more
>>> resources, but that explanation doesn't satisfy me.  I suppose
>>> allocation might have got slightly more CPU intensive in some cases, as
>>> we now use bitmaps where before we'd only use the cheaper-to-allocate
>>> extents.  But that's unsafisfying as well.
>>
>> I must admit, that I do not completely understand the difference
>> between bitmaps and extents.
>>
>> From what I see on my servers, I can tell, that the degradation over
>> time is gone. (Rebooting the servers every day is no longer needed.
>> This is a real plus.) But the performance compared to a freshly
>> booted, unpatched server is much slower with my ceph workload.
>>
>> I wonder if it would make sense to initialize the list field only,
>> when the cluster setup fails? This would avoid the fallback to the
>> much unclustered allocation and would give us the cheaper-to-allocate
>> extents.
>
> I've now tried various combinations of you patches and I can really
> nail it down to this one line.
>
> With this patch applied I get much higher write-io values than without
> it. Some of the other patches help to reduce the effect, but it's
> still significant.
>
> iostat on an unpatched node is giving me:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda             105.90     0.37   15.42   14.48  2657.33   560.13
> 107.61     1.89   62.75   6.26  18.71
>
> while on a node with this patch it's
> sda             128.20     0.97   11.10   57.15  3376.80   552.80
> 57.58    20.58  296.33   4.16  28.36
>
>
> Also interesting, is the fact that the average request size on the
> patched node is much smaller.
>
> Josef was telling me, that this could be related to the number of
> bitmaps we write out, but I've no idea how to trace this.
>
> I would be very happy if someone could give me a hint on what to do
> next, as this is one of the last remaining issues with our ceph
> cluster.

This is still bugging me and I just remembered something that might be
helpfull. Also I hope that this is not misleading...

Back in 2.6.38 we were running ceph without btrfs performance
degradation. I found a thread on the list where similar problems where
reported:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10346.html

In that thread someone bisected the issue to

>From 4e69b598f6cfb0940b75abf7e179d6020e94ad1e Mon Sep 17 00:00:00 2001
From: Josef Bacik 
Date: Mon, 21 Mar 2011 10:11:24 -0400
Subject: [PATCH] Btrfs: cleanup how we setup free space clusters

In this commit the bitmaps handling was changed. So I just thought
that this may be related.

I'm still hoping, that someone with a deeper understanding of btrfs
could take a look at this.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Product Order

2011-12-09 Thread


Hello,

I am Manager of SIMKINS LTD. USA, My company is interested in the
purchase of your products.

Kindly send me an email with details of:

*Minimum Order Quantity
*Your delivery time
*Payment terms
*And your products warranty

I await to hear from you urgently
Mr Stefan Al Simkins.
Purchasing Manager.
SIMKINS LTD

___
NOCC, http://nocc.sourceforge.net





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs switches to using mostly one thread

2011-12-09 Thread Jeremy Sanders

On 09/12/11 14:18, Chris Mason wrote:


According to this you've only got one delalloc worker.  That would
explain it.  Could you please confirm with ps?


Yes - only one delalloc worker is now present, but there were at least 
three initially.


Jeremy


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs switches to using mostly one thread

2011-12-09 Thread Chris Mason
On Fri, Dec 09, 2011 at 12:05:40PM +, Jeremy Sanders wrote:
> On 08/12/11 20:11, Chris Mason wrote:
> >On Thu, Dec 08, 2011 at 05:39:16PM +, Jeremy Sanders wrote:
> >>On 08/12/11 17:23, Chris Mason wrote:
> >>>On Thu, Dec 08, 2011 at 04:57:12PM +, Jeremy Sanders wrote:
> On 08/12/11 15:32, Chris Mason wrote:
> >On Thu, Dec 08, 2011 at 03:19:38PM +, Jeremy Sanders wrote:
> >>Hi - I'm trying out btrfs again, and I see the same old bug in kernel 
> >>3.1.4
> >>(Fedora 16, x86_64, dual-core), where after a few hours of writing, it
> >>switches from writing with several threads to writing with one:
> >
> >Ok, I'll try to reproduce this here.  Could you please do a sysrq-t, I'd
> >like to see what the other delalloc-writers are doing.
> 
> I've attached sysrq-t. It looks like it might be truncated at the
> beginning, however.
> >>>
> >>>/var/log/messages may have the whole thing, please do check.
> >>
> >>That was from /var/log/messages. I think it needs a longer
> >>log_buf_len. Unfortunately the system hasn't come back from its
> >>reboot, so it will have to wait until tomorrow when I can get to it
> >>physically.
> >
> >Ok, this trace shows that we have tar sitting in balance_dirty_pages and
> >we have the single delalloc worker doing requests.  The other delalloc
> >workers don't show up at all.
> >
> >So either they are earlier in the trace or they disappeared somehow.
> >I'll definitely need the full trace if you can send it.
> 
> I've got the full trace now. It's pretty big (430KB), so I've put it
> on the web.
> 
> Here's the state before switching to one thread
> http://www-xray.ast.cam.ac.uk/~jss/data/btrfs-before.txt
> 
> Here it is after it has switched to one thread:
> http://www-xray.ast.cam.ac.uk/~jss/data/btrfs-after.txt

According to this you've only got one delalloc worker.  That would
explain it.  Could you please confirm with ps?

You might be hitting a problem Josef sent patches for, I'll dig in.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)

2011-12-09 Thread Kai Krakow
Hello!

2011/12/8 Jan Schmidt :
> On 07.12.2011 21:40, Kai Krakow wrote:
[...]
>> The problematic file seems to be in /usr/portage but scrubbing doesn't tell
>> me the filename (I was under the impression 3.2.x adds a patch which should
>> report filenames).
>
> It should. Did you take a look at dmesg output after scrubbing? If it
> doesn't contain a hint on the file or block, please paste what you get.

I watched dmesg while scrubbing. Nothing there. To paste what I got I
need to find a way to make my 3.2-rc4 system boot again (without
freezing to due services and background jobs touching certain parts of
the broken filesystem) or create a 3.2 rescue system...

>> Everytime I run "emerge" (it is a gentoo system) my
>> screen goes black after a few seconds and I can only revert to using ssh.
>>
>> Problem is: As soon as this happens, some filesystem accesses block the
>> process in disk state, it cannot be killed. This initiates some feedback
>> loop: From now on any other process trying to access the FS freezes. I can
>> only reisub now. It seems to be fine if data comes from cache instead from
>> disk.
>
> Please try to grab sysrq+w output in this state.

I tried, nothing there. I wondered, why... This changed between 3.1
and 3.2. There is probably no blocking process because it got killed
by the kernel. Next process accessing the filesystem blocks (gets not
killed). I try to get a sysrq+w from this situation via ssh to
copy&paste dmesg somewhere but it will be difficult because usually
ssh communication freezes, too.

Maybe related: When the system was still running I was sometimes
seeing it use 100% CPU on one or two cores, looking at "top" I could
not see a process or kernel thread using the CPU but I saw the CPU
usage distributing on SYS%, WA% and USER%... This effect could only be
resolved by rebooting. It can be seen in both kernel 3.1 and 3.2, but
3.2 with much lower likelihood. However, even nice'd processes were
still able to acquire 100% cpu usage per core, so it didn't have any
effect on system performance.

I think I even made my situation worse... In an attempt to get the
error fixed, I deleted and recreated the subvolume with /usr/portage
(content is easily restorable from the internet). On next reboot the
btrfs cleaner kernel thread spit out a lot of errors and traces into
dmesg, system froze some minutes later so I couldn't save the output.
Now I cannot reliably boot and btrfs has problems accessing files all
over the filesystem, even in subvolumes that worked fine before. I
thought subvolumes are clearly separated from each other? Now I have
at least 3 different classes of error messages instead of only 1
single error.

Josef's repair program fails an assertion and cannot continue on the volume.

I think in order to stabilize btrfs it is important to make it handle
structure errors gracefully, and then invest into some repair utility.
I'd like to contribute but at some point in time I will need to get my
system back into a stable state and will recreate my filesystem from
scratch. Mounting the fs read-only allows me to access all parts of
the filesystem without problems. I still see errors in dmesg but no
kernel bugs or warnings with traces.

Regards,
Kai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs switches to using mostly one thread

2011-12-09 Thread Jeremy Sanders

On 09/12/11 12:15, Arne Jansen wrote:

On 08.12.2011 16:19, Jeremy Sanders wrote:

Hi - I'm trying out btrfs again, and I see the same old bug in kernel 3.1.4
(Fedora 16, x86_64, dual-core), where after a few hours of writing, it
switches from writing with several threads to writing with one:


How many disks does the fs have?


One - it's writing onto a "linear" MD array (for testing purposes). I 
disabled duplication of metadata as well, and zlib compression is forced.


Jeremy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs switches to using mostly one thread

2011-12-09 Thread Arne Jansen
On 08.12.2011 16:19, Jeremy Sanders wrote:
> Hi - I'm trying out btrfs again, and I see the same old bug in kernel 3.1.4 
> (Fedora 16, x86_64, dual-core), where after a few hours of writing, it 
> switches from writing with several threads to writing with one:

How many disks does the fs have?

-Arne
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs switches to using mostly one thread

2011-12-09 Thread Jeremy Sanders

On 08/12/11 20:11, Chris Mason wrote:

On Thu, Dec 08, 2011 at 05:39:16PM +, Jeremy Sanders wrote:

On 08/12/11 17:23, Chris Mason wrote:

On Thu, Dec 08, 2011 at 04:57:12PM +, Jeremy Sanders wrote:

On 08/12/11 15:32, Chris Mason wrote:

On Thu, Dec 08, 2011 at 03:19:38PM +, Jeremy Sanders wrote:

Hi - I'm trying out btrfs again, and I see the same old bug in kernel 3.1.4
(Fedora 16, x86_64, dual-core), where after a few hours of writing, it
switches from writing with several threads to writing with one:


Ok, I'll try to reproduce this here.  Could you please do a sysrq-t, I'd
like to see what the other delalloc-writers are doing.


I've attached sysrq-t. It looks like it might be truncated at the
beginning, however.


/var/log/messages may have the whole thing, please do check.


That was from /var/log/messages. I think it needs a longer
log_buf_len. Unfortunately the system hasn't come back from its
reboot, so it will have to wait until tomorrow when I can get to it
physically.


Ok, this trace shows that we have tar sitting in balance_dirty_pages and
we have the single delalloc worker doing requests.  The other delalloc
workers don't show up at all.

So either they are earlier in the trace or they disappeared somehow.
I'll definitely need the full trace if you can send it.


I've got the full trace now. It's pretty big (430KB), so I've put it on 
the web.


Here's the state before switching to one thread
http://www-xray.ast.cam.ac.uk/~jss/data/btrfs-before.txt

Here it is after it has switched to one thread:
http://www-xray.ast.cam.ac.uk/~jss/data/btrfs-after.txt

Jeremy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: keep orphans for subvolume deletion

2011-12-09 Thread Arne Jansen
Since we have the free space caches, btrfs_orphan_cleanup also runs for
the tree_root. Unfortunately this also cleans up the orphans used to mark
subvol deletions in progress.
Currently if a subvol deletion gets interrupted twice by umount/mount, the
deletion will not be continued and the space permanently lost, though it
would be possible to write a tool to recover those lost subvol deletions.
This patch checks if the orphan belongs to a subvol (dead root) and skips
the deletion.

Signed-off-by: Arne Jansen 
---
 fs/btrfs/inode.c |   32 
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c5ccec2..e30d38f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,6 +2158,38 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
if (ret && ret != -ESTALE)
goto out;
 
+   if (ret == -ESTALE && root == root->fs_info->tree_root) {
+   struct btrfs_root *dead_root;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   int is_dead_root = 0;
+
+   /*
+* this is an orphan in the tree root. Currently these
+* could come from 2 sources:
+*  a) a snapshot deletion in progress
+*  b) a free space cache inode
+* We need to distinguish those two, as the snapshot
+* orphan must not get deleted.
+* find_dead_roots already ran before us, so if this
+* is a snapshot deletion, we should find the root
+* in the dead_roots list
+*/
+   spin_lock(&fs_info->trans_lock);
+   list_for_each_entry(dead_root, &fs_info->dead_roots,
+   root_list) {
+   if (dead_root->root_key.objectid ==
+   found_key.objectid) {
+   is_dead_root = 1;
+   break;
+   }
+   }
+   spin_unlock(&fs_info->trans_lock);
+   if (is_dead_root) {
+   /* prevent this orphan from being found again */
+   key.offset = found_key.objectid - 1;
+   continue;
+   }
+   }
/*
 * Inode is already gone but the orphan item is still there,
 * kill the orphan item.
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html