raid 10 corruption from single drive failure

2013-06-29 Thread D. Spindel

Hi,
  I'm evaluating btrfs for a future deployment, and managed to 
(repeatedly
) get btrfs to the state where the system can't mount, can't fsck and 
can't
recover.


The test setup is pretty small, 6 devices of various size:

  butter-1.5GA vg_dolt -wi-a
1.50g
  butter-1.5GB vg_dolt -wi-a
1.50g
  butter-2GA vg_dolt -wi-a
2.00g
  butter-2GB vg_dolt -wi-a
2.00g
  butter-3GA vg_dolt -wi-a
3.00g
  butter-3GB vg_dolt -wi-a
3.00g


Created an btrfs volume:
mkfs.btrfs -d raid10 -m raid1 /dev/mapper/vg_dolt-butter--1.5GA
/dev/mapper/vg_dolt-butter--1.5GA /dev/mapper/vg_dolt-butter--2GA
/dev/mapper/vg_dolt-butter--2GB /dev/mapper/vg_dolt-butter--3GA
/dev/mapper/vg_dolt-butter--3GB


( Note how above it is mistyped, This is a 5 disk raid10. Where 1.5GA 
was
listed twice. )


--
mount it and fill it with files ( I downloaded parts of the fedora 
src.rpm
tree ).

unmount the partition

Zero one drive
dd if=/dev/zero of=/dev/vg_dolt/butter-3GB bs=1M skip=100

( It's sort of hard to fake a corrupt drive, this is a decent way of 
doing
it )

trying to mount it gives the following setup:
Jun 28 23:58:34 dolt kernel: [2815554.803082] device fsid
379e495a-9ba7-4485-ae74-6c8939f7b22e devid 3 transid 27
/dev/mapper/vg_dolt-butter--2GB
Jun 28 23:58:34 dolt kernel: [2815554.850211] btrfs: disk space caching 
is
enabled
Jun 28 23:58:34 dolt kernel: [2815554.850856] btrfs: failed to read 
chunk
tree on dm-6
Jun 28 23:58:34 dolt kernel: [2815554.856453] btrfs: open_ctree failed
Jun 28 23:58:44 dolt kernel: [2815565.475519] device fsid
379e495a-9ba7-4485-ae74-6c8939f7b22e devid 3 transid 27
/dev/mapper/vg_dolt-butter--2GB
Jun 28 23:58:44 dolt kernel: [2815565.476939] btrfs: enabling auto 
recovery
Jun 28 23:58:44 dolt kernel: [2815565.476944] btrfs: disk space caching 
is
enabled
Jun 28 23:58:44 dolt kernel: [2815565.477648] btrfs: failed to read 
chunk
tree on dm-6
Jun 28 23:58:44 dolt kernel: [2815565.486300] btrfs: open_ctree failed
Jun 28 23:58:52 dolt kernel: [2815573.522271] device fsid
379e495a-9ba7-4485-ae74-6c8939f7b22e devid 2 transid 27
/dev/mapper/vg_dolt-butter--2GA
Jun 28 23:58:52 dolt kernel: [2815573.536624] btrfs: enabling auto 
recovery
Jun 28 23:58:52 dolt kernel: [2815573.536628] btrfs: disk space caching 
is
enabled
Jun 28 23:58:52 dolt kernel: [2815573.537185] btrfs: failed to read 
chunk
tree on dm-6
Jun 28 23:58:52 dolt kernel: [2815573.542938] btrfs: open_ctree failed


[root@dolt mnt]# btrfsck /dev/vg_dolt/butter-2GA
failed to read /dev/sr0
failed to read /dev/sr0
warning, device 5 is missing
warning devid 5 not found already
checking extents
checking fs roots
checking root refs
Segmentation fault

[root@dolt mnt]# mount -o recovery,ro /dev/mapper/vg_dolt-butter--2GA
/mnt/test/
mount: wrong fs type, bad option, bad superblock on
/dev/mapper/vg_dolt-butter--2GA,
   missing codepage or helper program, or other error
   In some cases useful info is found in syslog - try
   dmesg | tail or so
[root@dolt mnt]#

debuginfo-install btrfs-progs-0.20.rc1.20121017git91d9eec-3.fc18.x86_64


[root@dolt mnt]# gdb btrfsck /dev/vg_dolt/butter-2GA
GNU gdb (GDB) Fedora (7.5.1-38.fc18)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
http://gnu.org/licenses/gpl.html
 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type show 
copying
and show warranty for details.
This GDB was configured as x86_64-redhat-linux-gnu.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/sbin/btrfsck...Reading symbols from
/usr/lib/debug/usr/sbin/btrfsck.debug...done.
done.
/dev/vg_dolt/butter-2GA is not a core dump: File format not recognized
(gdb) run /dev/vg_dolt/butter-2GA
Starting program: /usr/sbin/btrfsck /dev/vg_dolt/butter-2GA
failed to read /dev/sr0
failed to read /dev/sr0
warning, device 5 is missing
warning devid 5 not found already
checking extents
checking fs roots
checking root refs

Program received signal SIGSEGV, Segmentation fault.
__GI___libc_free (mem=0x80) at malloc.c:2907
2907 if (chunk_is_mmapped(p)) /* release mmapped
memory. */
(gdb) bt full
#0 __GI___libc_free (mem=0x80) at malloc.c:2907
ar_ptr = optimized out
p = optimized out
hook = 0x0
#1 0x0040d429 in close_all_devices (fs_info=0x6323e0) at
disk-io.c:1088
list = 0x631050
next = 0x6300b0
tmp = 0x630430
device = 0x6300b0
#2 0x0040e3df in close_ctree (root=root@entry=0x6426e0) at
disk-io.c:1135
ret = optimized out
fs_info = 0x6323e0
__PRETTY_FUNCTION__ = close_ctree
#3 0x00401d8d in main (ac=optimized out, av=optimized out) 
at
btrfsck.c:3593
root_cache = {root = {rb_node = 0x0, rotate_notify = 
0x423aad
__libc_csu_init+93}}
root = optimized out
info = optimized out
trans = optimized out
bytenr = 

Re: raid 10 corruption from single drive failure

2013-06-29 Thread cwillu
 Making this with all 6 devices from the beginning and btrfsck doesn't
 segfault. But it also doesn't repair the system enough to make it
 mountable. ( nether does -o recover, however -o degraded works, and
 files
 are then accessible )

Not sure I entirely follow: mounting with -o degraded (not -o
recovery) is how you're supposed to mount if there's a disk missing.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 inefficient unbalanced filesystem reads

2013-06-29 Thread Russell Coker
On Sat, 29 Jun 2013, Martin m_bt...@ml1.co.uk wrote:
 Mmmm... I'm not sure trying to balance historical read/write counts is
 the way to go... What happens for the use case of an SSD paired up with
 a HDD? (For example an SSD and a similarly sized Raptor or enterprise
 SCSI?...) Or even just JBODs of a mishmash of different speeds?
 
 Rather than trying to balance io counts, can a realtime utilisation
 check be made and go for the least busy?

It would also be nice to be able to tune this.  For example I've got a RAID-1 
array that's mounted noatime, hardly ever written, and accessed via NFS on 
100baseT.  It would be nice if one disk could be spun down for most of the 
time and save 7W of system power.  Something like the --write-mostly option of 
mdadm would be good here.

Also it should be possible for a RAID-1 array to allow faster reads for a 
single process reading a single file if the file in question is fragmented.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: fix crash regarding to ulist_add_merge

2013-06-29 Thread Liu Bo
On Fri, Jun 28, 2013 at 12:43:14PM -0700, Zach Brown wrote:
 On Fri, Jun 28, 2013 at 12:37:45PM +0800, Liu Bo wrote:
  Several users reported this crash of NULL pointer or general protection,
  the story is that we add a rbtree for speedup ulist iteration, and we
  use krealloc() to address ulist growth, and krealloc() use memcpy to copy
  old data to new memory area, so it's OK for an array as it doesn't use
  pointers while it's not OK for a rbtree as it uses pointers.
  
  So krealloc() will mess up our rbtree and it ends up with crash.
  
  Reviewed-by: Wang Shilong wangsl-f...@cn.fujitsu.com
  Signed-off-by: Liu Bo bo.li@oracle.com
 
 Yeah, this should fix the probem.  Thanks for being persistent.
 
 Reviewed-by: Zach Brown z...@redhat.com
 
  +   for (i = 0; i  ulist-nnodes; i++)
  +   rb_erase(ulist-nodes[i].rb_node, ulist-root);
 
 (still twitching over here because this is a bunch of work that achieves
 nothing :))

Hmm, I think that this is necessary for the inline array inside ulist,
so I keep it :)

- liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix crash regarding to ulist_add_merge

2013-06-29 Thread Liu Bo
On Fri, Jun 28, 2013 at 01:08:21PM -0400, Josef Bacik wrote:
 On Fri, Jun 28, 2013 at 10:25:39AM +0800, Liu Bo wrote:
  Several users reported this crash of NULL pointer or general protection,
  the story is that we add a rbtree for speedup ulist iteration, and we
  use krealloc() to address ulist growth, and krealloc() use memcpy to copy
  old data to new memory area, so it's OK for an array as it doesn't use
  pointers while it's not OK for a rbtree as it uses pointers.
  
  So krealloc() will mess up our rbtree and it ends up with crash.
  
  Signed-off-by: Liu Bo bo.li@oracle.com
  ---
  v2: fix an use-after-free bug and a finger error(Thanks Zach and Josef).
  
 
 Is this supposed to fix this bug?
 
 [ 1215.561033] [ cut here ]
 [ 1215.561064] kernel BUG at fs/btrfs/ctree.c:1183!
 [ 1215.561087] invalid opcode:  [#1] PREEMPT SMP
 [ 1215.561114] Modules linked in: btrfs raid6_pq zlib_deflate xor libcrc32c 
 ebtable_nat ebtables ipt_MASQUERADE
 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc 
 lockd be2iscsi iscsi_boot_sysfs bnx2i cnic uio
 cxgb4
 i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad 
 ip6t_REJECT nf_conntrack_ipv6 ib_core
 nf_defrag_ipv6 ib_addr nf_conntrack_ipv4 iscsi_tcp nf_defrag_ipv4 xt_state 
 nf_conntrack libiscsi_tcp ip6table_filter
 libisc
 si ip6_tables scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek 
 snd_hda_intel snd_hda_codec snd_hwdep
 snd_seq snd_seq_device snd_pcm vhost_net snd_timer macvtap snd macvlan tun 
 virtio_net soundcore kvm_amd sunrpc kvm
 snd_page
 _alloc sp5100_tco edac_core microcode pcspkr serio_raw k10temp edac_mce_amd 
 i2c_piix4 r8169 mii iomemory_vsl(OF) floppy
 firewire_ohci firewire_core ata_generic pata_acpi crc_itu_t pata_via radeon 
 ttm drm_kms_helper drm i2c_algo_bit i2c_c
 ore
 [ 1215.561585] CPU 1
 [ 1215.561597] Pid: 28188, comm: btrfs-endio-wri Tainted: GF  O 
 3.9.0+ #9 To Be Filled By O.E.M. To Be Filled By
 O.E.M./890FX Deluxe5
 [ 1215.561649] RIP: 0010:[a06f529b]  [a06f529b] 
 __tree_mod_log_rewind+0x26b/0x270 [btrfs]
 [ 1215.561706] RSP: 0018:8803b7529828  EFLAGS: 00010293
 [ 1215.561729] RAX:  RBX: 8803b42d5960 RCX: 
 8803b75297c8
 [ 1215.561759] RDX: 0002577d RSI: 0921 RDI: 
 8803b3e92440
 [ 1215.561788] RBP: 8803b7529858 R08: 1000 R09: 
 8803b75297d8
 [ 1215.561818] R10: 1bbb R11:  R12: 
 8803b630ddc0
 [ 1215.561848] R13: 0044 R14: 8803b3e92540 R15: 
 00017add
 [ 1215.561878] FS:  7f9ba1ce7700() GS:88043fc4() 
 knlGS:
 [ 1215.561911] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 1215.561936] CR2: 7fa4a6148d90 CR3: 000427ff7000 CR4: 
 07e0
 [ 1215.561965] DR0:  DR1:  DR2: 
 
 [ 1215.561995] DR3:  DR6: 0ff0 DR7: 
 0400
 [ 1215.562025] Process btrfs-endio-wri (pid: 28188, threadinfo 
 8803b7528000, task 8803eb5a97d0)
 [ 1215.562063] Stack:
 [ 1215.562073]  88042998e1c0 8800 88042998e1c0 
 8803c41b8000
 [ 1215.562109]  8803b43c4e20 0001 8803b7529908 
 a06fda47
 [ 1215.562146]  8803b7694458 00017add 8803b7529888 
 8803b42d5960
 [ 1215.562182] Call Trace:
 [ 1215.562200]  [a06fda47] btrfs_search_old_slot+0x757/0xa40 [btrfs]
 [ 1215.562237]  [a0779fcd] __resolve_indirect_refs+0x11d/0x670 
 [btrfs]
 [ 1215.562273]  [a077ab4c] find_parent_nodes+0x1fc/0xe90 [btrfs]
 [ 1215.562307]  [a077b879] btrfs_find_all_roots+0x99/0x100 [btrfs]
 [ 1215.562341]  [a07240b0] ? btrfs_submit_direct+0x680/0x680 [btrfs]
 [ 1215.562376]  [a077c224] iterate_extent_inodes+0x144/0x2f0 [btrfs]
 [ 1215.562412]  [a077c462] iterate_inodes_from_logical+0x92/0xb0 
 [btrfs]
 [ 1215.562449]  [a07240b0] ? btrfs_submit_direct+0x680/0x680 [btrfs]
 [ 1215.562484]  [a07214f8] record_extent_backrefs+0x78/0xf0 [btrfs]
 [ 1215.562519]  [a072bac6] btrfs_finish_ordered_io+0x156/0x9d0 
 [btrfs]
 [ 1215.562556]  [a072c355] finish_ordered_fn+0x15/0x20 [btrfs]
 [ 1215.562589]  [a074d96a] worker_loop+0x16a/0x570 [btrfs]
 [ 1215.562618]  [8108f348] ? __wake_up_common+0x58/0x90
 [ 1215.562649]  [a074d800] ? btrfs_queue_worker+0x300/0x300 [btrfs]
 [ 1215.562680]  [81086c10] kthread+0xc0/0xd0
 [ 1215.562703]  [8165] ? acpi_processor_add+0xcb/0x47d
 [ 1215.562731]  [81086b50] ? flush_kthread_worker+0xb0/0xb0
 [ 1215.562758]  [8166452c] ret_from_fork+0x7c/0xb0
 [ 1215.562783]  [81086b50] ? flush_kthread_worker+0xb0/0xb0
 [ 1215.562809] Code: c1 49 63 46 58 48 89 c2 48 c1 e2 05 48 8d 54 10 65 49 63 
 46 2c 48 89 c6 48 c1 e6 05 48 8d 74 30 65
 e8 0a c7 04 00 e9 9d fe ff ff 0f 

Re: [PATCH] Btrfs: make backref walking code handle skinny metadata

2013-06-29 Thread Liu Bo
On Fri, Jun 28, 2013 at 01:12:58PM -0400, Josef Bacik wrote:
 I missed fixing the backref stuff when I introduced the skinny metadata.  If 
 you
 try and do things like snapshot aware defrag with skinny metadata you are 
 going
 to see tons of warnings related to the backref count being less than 0.  This 
 is
 because the delayed refs will be found for stuff just fine, but it won't find
 the skinny metadata extent refs.  With this patch I'm not seeing warnings
 anymore.  Thanks,

Reviewed-by: Liu Bo bo.li@oracle.com

- liubo

 
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/backref.c |   31 +--
  1 files changed, 25 insertions(+), 6 deletions(-)
 
 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 431ea92..eaf1333 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -597,6 +597,7 @@ static int __add_inline_refs(struct btrfs_fs_info 
 *fs_info,
   int slot;
   struct extent_buffer *leaf;
   struct btrfs_key key;
 + struct btrfs_key found_key;
   unsigned long ptr;
   unsigned long end;
   struct btrfs_extent_item *ei;
 @@ -614,17 +615,21 @@ static int __add_inline_refs(struct btrfs_fs_info 
 *fs_info,
  
   ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
   flags = btrfs_extent_flags(leaf, ei);
 + btrfs_item_key_to_cpu(leaf, found_key, slot);
  
   ptr = (unsigned long)(ei + 1);
   end = (unsigned long)ei + item_size;
  
 - if (flags  BTRFS_EXTENT_FLAG_TREE_BLOCK) {
 + if (found_key.type == BTRFS_EXTENT_ITEM_KEY 
 + flags  BTRFS_EXTENT_FLAG_TREE_BLOCK) {
   struct btrfs_tree_block_info *info;
  
   info = (struct btrfs_tree_block_info *)ptr;
   *info_level = btrfs_tree_block_level(leaf, info);
   ptr += sizeof(struct btrfs_tree_block_info);
   BUG_ON(ptr  end);
 + } else if (found_key.type == BTRFS_METADATA_ITEM_KEY) {
 + *info_level = found_key.offset;
   } else {
   BUG_ON(!(flags  BTRFS_EXTENT_FLAG_DATA));
   }
 @@ -796,8 +801,11 @@ static int find_parent_nodes(struct btrfs_trans_handle 
 *trans,
   INIT_LIST_HEAD(prefs_delayed);
  
   key.objectid = bytenr;
 - key.type = BTRFS_EXTENT_ITEM_KEY;
   key.offset = (u64)-1;
 + if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
 + key.type = BTRFS_METADATA_ITEM_KEY;
 + else
 + key.type = BTRFS_EXTENT_ITEM_KEY;
  
   path = btrfs_alloc_path();
   if (!path)
 @@ -862,7 +870,8 @@ again:
   slot = path-slots[0];
   btrfs_item_key_to_cpu(leaf, key, slot);
   if (key.objectid == bytenr 
 - key.type == BTRFS_EXTENT_ITEM_KEY) {
 + (key.type == BTRFS_EXTENT_ITEM_KEY ||
 +  key.type == BTRFS_METADATA_ITEM_KEY)) {
   ret = __add_inline_refs(fs_info, path, bytenr,
   info_level, prefs);
   if (ret)
 @@ -1276,12 +1285,16 @@ int extent_from_logical(struct btrfs_fs_info 
 *fs_info, u64 logical,
  {
   int ret;
   u64 flags;
 + u64 size = 0;
   u32 item_size;
   struct extent_buffer *eb;
   struct btrfs_extent_item *ei;
   struct btrfs_key key;
  
 - key.type = BTRFS_EXTENT_ITEM_KEY;
 + if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
 + key.type = BTRFS_METADATA_ITEM_KEY;
 + else
 + key.type = BTRFS_EXTENT_ITEM_KEY;
   key.objectid = logical;
   key.offset = (u64)-1;
  
 @@ -1294,9 +1307,15 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
 u64 logical,
   return ret;
  
   btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]);
 - if (found_key-type != BTRFS_EXTENT_ITEM_KEY ||
 + if (found_key-type == BTRFS_METADATA_ITEM_KEY)
 + size = fs_info-extent_root-leafsize;
 + else if (found_key-type == BTRFS_EXTENT_ITEM_KEY)
 + size = found_key-offset;
 +
 + if ((found_key-type != BTRFS_EXTENT_ITEM_KEY 
 +  found_key-type != BTRFS_METADATA_ITEM_KEY) ||
   found_key-objectid  logical ||
 - found_key-objectid + found_key-offset = logical) {
 + found_key-objectid + size = logical) {
   pr_debug(logical %llu is not within any extent\n,
(unsigned long long)logical);
   return -ENOENT;
 -- 
 1.7.7.6
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfsck output: What does it all mean?

2013-06-29 Thread Martin
This is the btrfsck output for a real-world rsync backup onto a btrfs
raid1 mirror across 4 drives (yes, I know at the moment for btrfs raid1
there's only ever two copies of the data...)


checking extents
checking fs roots
root 5 inode 18446744073709551604 errors 2000
root 5 inode 18446744073709551605 errors 1
root 256 inode 18446744073709551604 errors 2000
root 256 inode 18446744073709551605 errors 1
found 3183604633600 bytes used err is 1
total csum bytes: 3080472924
total tree bytes: 28427821056
total fs tree bytes: 23409475584
btree space waste bytes: 4698218231
file data blocks allocated: 3155176812544
 referenced 3155176812544
Btrfs Btrfs v0.19
Command exited with non-zero status 1


So: What does that little lot mean?

The drives were mounted and active during an unexpected power-plug pull :-(


Safe to mount again or are there other checks/fixes needed?

Thanks,
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 inefficient unbalanced filesystem reads

2013-06-29 Thread Martin
On 29/06/13 10:41, Russell Coker wrote:
 On Sat, 29 Jun 2013, Martin wrote:
 Mmmm... I'm not sure trying to balance historical read/write counts is
 the way to go... What happens for the use case of an SSD paired up with
 a HDD? (For example an SSD and a similarly sized Raptor or enterprise
 SCSI?...) Or even just JBODs of a mishmash of different speeds?

 Rather than trying to balance io counts, can a realtime utilisation
 check be made and go for the least busy?
 
 It would also be nice to be able to tune this.  For example I've got a RAID-1 
 array that's mounted noatime, hardly ever written, and accessed via NFS on 
 100baseT.  It would be nice if one disk could be spun down for most of the 
 time and save 7W of system power.  Something like the --write-mostly option 
 of 
 mdadm would be good here.

For that case, a --read-mostly would be more apt ;-)

Hence, add a check to preferentially use last disk used if all are idle?


 Also it should be possible for a RAID-1 array to allow faster reads for a 
 single process reading a single file if the file in question is fragmented.

That sounds good but complicated to gather and sort the fragments into
groups per disk... Or is something like that already done by the block
device elevator for HDDs?

Also, is head seek optimisation turned off for SSD accesses?


(This is sounding like a lot more than just swapping:

current-pid % map-num_stripes

to a

psuedorandomhash( current-pid ) % map-num_stripes

... ;-) )


Are there any readily accessible present state for such as disk activity
or queue length or access latency available for the btrfs process to read?

I suspect a good first guess to cover many conditions would be to
'simply' choose whichever device is powered up and has the lowest
current latency, or if idle has the lowest historical latency...


Regards,
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck output: What does it all mean?

2013-06-29 Thread Duncan
Martin posted on Sat, 29 Jun 2013 14:48:40 +0100 as excerpted:

 This is the btrfsck output for a real-world rsync backup onto a btrfs
 raid1 mirror across 4 drives (yes, I know at the moment for btrfs raid1
 there's only ever two copies of the data...)

Being just a btrfs user I don't have a detailed answer, but perhaps this 
helps.

First of all, a btrfs-tools update is available, v0.20-rc1.  Given that 
btrfs is still experimental and the rate of development, even using the 
live-git version (as I do), is probably the best idea, but certainly, I'd 
encourage you to get the 0.20-rc1 version at least.  FWIW, v0.20-rc1-335-
gf00dd83 is what I'm running, that's 335 commits after rc1, on git-commit 
f00dd83.

(Of course similarly with the kernel.  You may not want to run the
live-git mainline kernel during the commit window or even the first 
couple of rcs, but starting with rc3 or so, a new mainline pre-release 
kernel should be /reasonably/ safe to run in general, and the new kernel 
will have enough fixes to btrfs that you really should be running it.  Of 
course if you've experienced and filed a bug with it and are back on the 
latest full stable release until it's fixed, or if there's a known btrfs 
regression in the new version that you're waiting on a fix for, then the 
latest version without that fix is good, but otherwise, if you're not 
running the latest kernel and btrfs-tools, you really might be taking 
chances with your data that you don't need to take, due to already 
existing fixes you're not yet running.)

 checking extents
 checking fs roots
 root 5 inode 18446744073709551604 errors 2000
 root 5 inode 18446744073709551605 errors 1
 root 256 inode 18446744073709551604 errors 2000
 root 256 inode 18446744073709551605 errors 1

Based on the root numbers, I'd guess those are subvolume IDs.  The 
original root volume has ID 5, and the first subvolume created under it 
has ID 256, based on my own experience.

What the error numbers refer to I don't know.  However, based on the 
identical inode and error numbers seen in both subvolumes, I'd guess that 
#256 is a snapshot of #5, and that whatever is triggering the errors 
hadn't been written after the snapshot (thus copying the data to a new 
location), so when the errors happened in the one, it happened in the 
other as well, since they're both the same location.

The good news of that is that in reality that's only the one set of 
errors duplicated twice.  The bad news is that it affects both snapshots, 
so if you don't have different snapshot with a newer/older copy of 
whatever's damaged in those two, you may simply lose it.

 found 3183604633600 bytes used err is 1
 total csum bytes: 3080472924

csum would be checksum...  The rest, above and below, says in the output 
pretty much what I'd be able to make of it, so I've nothing really to add 
about that.

 total tree bytes: 28427821056
 total fs tree bytes: 23409475584
 btree space waste bytes: 4698218231
 file data blocks allocated: 3155176812544
  referenced 3155176812544

 Btrfs Btrfs v0.19

Meanwhile, you didn't mention anything about the --repair option.  If you 
didn't use it just because you want to know a bit more about what it's 
doing first, OK, but while btrfsck lacked a repair option for quite some 
time, it has had a --repair option for over year now, so it /is/ possible 
to try to repair the detected damage, these days.

Of course you might be running a really old 0.19+ snapshot without that 
ability (distros packaged 0.19+ snapshots for some time during which 
there was no upstream release, tho hopefully the distro package has 
something about the snapshot it was, but we know your version is old in 
any case since it's not 0.20-rc1 or newer, but still 0.19 something).

I'd suggest ensuring that you're running the latest almost-release 3.10-
rc7+ kernel and the latest btrfs-tools, then both trying a mount and 
running the btrfsck again.  You can both watch the output and check the 
kernel log for output as it runs, and as you try to mount the 
filesystem.  It may be that a newer kernel (presuming your kernel is as 
old as your btrfs-tools appear to be) might fix whatever's damaged on-
mount, so btrfsck won't have anything left to do.  If not, since you have 
backups of the data (well, this was the backup, you have the originals) 
if anything goes wrong, you can try the --repair option and see what 
happens.  If that doesn't fix it, post the logs and output from the 
updated kernel and btrfs-tools btrfsck, and ask the experts about it once 
they have that to look at too.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-06-29 Thread Josef Bacik
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/ctree.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c32d03d..7921e1d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1161,8 +1161,8 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
  * time_seq).
  */
 static void
-__tree_mod_log_rewind(struct extent_buffer *eb, u64 time_seq,
- struct tree_mod_elem *first_tm)
+__tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
+ u64 time_seq, struct tree_mod_elem *first_tm)
 {
u32 n;
struct rb_node *next;
@@ -1172,6 +1172,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64 
time_seq,
unsigned long p_size = sizeof(struct btrfs_key_ptr);
 
n = btrfs_header_nritems(eb);
+   tree_mod_log_read_lock(fs_info);
while (tm  tm-seq = time_seq) {
/*
 * all the operations are recorded with the operator used for
@@ -1226,6 +1227,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64 
time_seq,
if (tm-index != first_tm-index)
break;
}
+   tree_mod_log_read_unlock(fs_info);
btrfs_set_header_nritems(eb, n);
 }
 
@@ -1274,7 +1276,7 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
extent_buffer *eb,
 
extent_buffer_get(eb_rewin);
btrfs_tree_read_lock(eb_rewin);
-   __tree_mod_log_rewind(eb_rewin, time_seq, tm);
+   __tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
WARN_ON(btrfs_header_nritems(eb_rewin) 
BTRFS_NODEPTRS_PER_BLOCK(fs_info-tree_root));
 
@@ -1350,7 +1352,7 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
btrfs_set_header_generation(eb, old_generation);
}
if (tm)
-   __tree_mod_log_rewind(eb, time_seq, tm);
+   __tree_mod_log_rewind(root-fs_info, eb, time_seq, tm);
else
WARN_ON(btrfs_header_level(eb) != 0);
WARN_ON(btrfs_header_nritems(eb)  BTRFS_NODEPTRS_PER_BLOCK(root));
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html