[PATCH] Btrfs: fix race deleting block group from space_info-ro_bgs list

2015-01-15 Thread Filipe Manana
When removing a block group we were deleting it from its space_info's
ro_bgs list, using list_del_init, without any synchronization.
Fix this by doing the list delete while holding the space info and
block group spinlocks.

This issue was introduced in the 3.19 kernel by the following change:

Btrfs: move read only block groups onto their own list V2
commit 633c0aad4c0243a506a3e8590551085ad78af82d

I ran into a kernel crash while a block group was being removed, another
task was executing statfs in parallel (iterating the space_info-ro_bgs
list) and other another task was setting another block group to readonly
mode (which adds it to the list space_info-ro_bgs). This happened while
running the stress test xfstests/generic/038 I recently made.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5a45253..09145ac 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9424,7 +9424,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
 * are still on the list after taking the semaphore
 */
list_del_init(block_group-list);
-   list_del_init(block_group-ro_list);
if (list_empty(block_group-space_info-block_groups[index])) {
kobj = block_group-space_info-block_group_kobjs[index];
block_group-space_info-block_group_kobjs[index] = NULL;
@@ -9466,6 +9465,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
btrfs_remove_free_space_cache(block_group);
 
spin_lock(block_group-space_info-lock);
+   spin_lock(block_group-lock);
+   list_del_init(block_group-ro_list);
+   spin_unlock(block_group-lock);
block_group-space_info-total_bytes -= block_group-key.offset;
block_group-space_info-bytes_readonly -= block_group-key.offset;
block_group-space_info-disk_total -= block_group-key.offset * factor;
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status reports not running when it is

2015-01-15 Thread Zach Brown
On Thu, Jan 15, 2015 at 12:24:41PM +0100, David Sterba wrote:
 On Wed, Jan 14, 2015 at 02:27:17PM -0800, Zach Brown wrote:
  On Wed, Jan 14, 2015 at 04:06:02PM -0500, Sandy McArthur Jr wrote:
   Sometimes btrfs scrub status reports that is not running when it still is.
   
   I think this a cosmetic bug. And I believe this is related to the
   scrub completing on some drives before others in a multi-drive btrfs
   filesystem that is not well balanced.
  
  Boy, I don't really know this code, but it looks like:
  
  if (ss-in_progress)
  printf(, running for %llu seconds\n, ss-duration);
  else
  printf(, interrupted after %llu seconds, not running\n,
  ss-duration);
  
  in_progress = is_scrub_running_in_kernel(fdmnt, di_args, 
  fi_args.num_devices);
  
  static int is_scrub_running_in_kernel(int fd,
  struct btrfs_ioctl_dev_info_args *di_args, u64 max_devices)
  {
  struct scrub_progress sp;
  int i;
  int ret;
  
  for (i = 0; i  max_devices; i++) {
  memset(sp, 0, sizeof(sp));
  sp.scrub_args.devid = di_args[i].devid;
  ret = ioctl(fd, BTRFS_IOC_SCRUB_PROGRESS, sp.scrub_args);
  if (ret  0  errno == ENODEV)
  continue;
  if (ret  0  errno == ENOTCONN)
  return 0;
  
  It says that scrub isn't running if any devices have completed.  If you drop
  all those ret  0 conditional branches that are either noops or wrong, does 
  it
  work like you'd expect?
 
 Why wrong? The ioctl callback returns -ENODEV or -ENOTCONN that get
 translated to the errno values and ioctl(...) returns -1 in both cases.

Wrong because returning 0 on the first ENOTCONN, instead of continuing
to find more devices which might still be scrubbing, leads to this
confusing status message.

That's my working theory having spent 15 seconds reading code.  I would
be not surprised at all if I'm missing something here.


- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel bug in 3.19-rc4

2015-01-15 Thread Marcel Ritter
Hi,

I just started some btrfs stress testing on latest linux kernel 3.19-rc4:
A few hours later, filesystem stopped working - the kernel bug report
can be found below.

The test consists of one massive IO thread (writing 100GB files with dd),
and 2 tar instances extracting kernel sources and deleting them afterwards
(I can provide the simple bash script doing this, if needed).

System information (Ubuntu 14.04.1, latest kernel):

root@thunder # uname -a
Linux thunder 3.19.0-rc4-custom #1 SMP Mon Jan 12 16:13:44 CET 2015
x86_64 x86_64 x86_64 GNU/Linux

root@thunder # /root/btrfs-progs/btrfs --version
Btrfs v3.18-36-g0173148

Tests are done on 14 SCSI disks, using raid6 for data and metadata:

root@thunder # /root/btrfs-progs/btrfs fi show
Label: 'raid6'  uuid: cbe34d2b-5f75-46cf-9263-9813028ebc19
Total devices 14 FS bytes used 674.62GiB
devid1 size 279.39GiB used 59.24GiB path /dev/cciss/c1d0
devid2 size 279.39GiB used 59.22GiB path /dev/cciss/c1d1
devid3 size 279.39GiB used 59.22GiB path /dev/cciss/c1d10
devid4 size 279.39GiB used 59.22GiB path /dev/cciss/c1d11
devid5 size 279.39GiB used 59.22GiB path /dev/cciss/c1d12
devid6 size 279.39GiB used 59.22GiB path /dev/cciss/c1d13
devid7 size 279.39GiB used 59.22GiB path /dev/cciss/c1d2
devid8 size 279.39GiB used 59.22GiB path /dev/cciss/c1d3
devid9 size 279.39GiB used 59.22GiB path /dev/cciss/c1d4
devid   10 size 279.39GiB used 59.22GiB path /dev/cciss/c1d5
devid   11 size 279.39GiB used 59.22GiB path /dev/cciss/c1d6
devid   12 size 279.39GiB used 59.22GiB path /dev/cciss/c1d7
devid   13 size 279.39GiB used 59.22GiB path /dev/cciss/c1d8
devid   14 size 279.39GiB used 59.22GiB path /dev/cciss/c1d9

Btrfs v3.18-36-g0173148

# This is provided for completeness only, and is taken
# somewhen *before* the kernel crash occured, so basic
# setup is the same, but allocated/free sizes won't match
root@thunder # /root/btrfs-progs/btrfs fi df /tmp/m
Data, single: total=8.00MiB, used=0.00B
Data, RAID6: total=727.45GiB, used=697.84GiB
System, single: total=4.00MiB, used=0.00B
System, RAID6: total=13.50MiB, used=64.00KiB
Metadata, single: total=8.00MiB, used=0.00B
Metadata, RAID6: total=3.43GiB, used=805.91MiB
GlobalReserve, single: total=272.00MiB, used=0.00B


Here's what happens after some hours of stress testing:

[85162.472989] [ cut here ]
[85162.473071] kernel BUG at fs/btrfs/inode.c:3142!
[85162.473139] invalid opcode:  [#1] SMP
[85162.473212] Modules linked in: btrfs(E) xor(E) raid6_pq(E)
radeon(E) ttm(E) drm_kms_helper(E) drm(E) hpwdt(E) amd64_edac_mod(E)
kvm(E) edac_core(E) shpchp(E) k8temp(E) serio_raw(E) hpilo(E)
edac_mce_amd(E) mac_hid(E) i2c_algo_bit(E) ipmi_si(E) nfsd(E)
auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) lp(E)
fscache(E) parport(E) hid_generic(E) usbhid(E) hid(E) hpsa(E)
psmouse(E) bnx2(E) cciss(E) pata_acpi(E) pata_amd(E)
[85162.473911] CPU: 4 PID: 3039 Comm: btrfs-cleaner Tainted: G
   E  3.19.0-rc4-custom #1
[85162.474028] Hardware name: HP ProLiant DL585 G2   , BIOS A07 05/02/2011
[85162.474122] task: 88085b054aa0 ti: 88205ad4c000 task.ti:
88205ad4c000
[85162.474230] RIP: 0010:[a06a8182]  [a06a8182]
btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[85162.474422] RSP: 0018:88205ad4fc48  EFLAGS: 00010286
[85162.474497] RAX: ffe4 RBX: 8810a35d42f8 RCX: 88185b896000
[85162.474595] RDX: 6a54 RSI: 0004 RDI: 88185b896138
[85162.474694] RBP: 88205ad4fc88 R08: 0001e670 R09: 88016194b240
[85162.474793] R10: a06bd797 R11: ea0004f71800 R12: 88185baa2000
[85162.474892] R13: 88085f6d7630 R14: 88185baa2458 R15: 0001
[85162.474992] FS:  7fb3f27fb740() GS:88085fd0()
knlGS:
[85162.475105] CS:  0010 DS:  ES:  CR0: 8005003b
[85162.475184] CR2: 7f896c02c220 CR3: 00085b328000 CR4: 07e0
[85162.475286] Stack:
[85162.475318]  88205ad4fc88 a06e6a14 88185b896b04
88105b03e800
[85162.475442]  88016194b240 8810a35d42f8 881e8ffe9a00
88133dc48ea0
[85162.475561]  88205ad4fd18 a0691a57 88016194b244
88016194b240
[85162.475680] Call Trace:
[85162.475738]  [a06e6a14] ?
lookup_free_space_inode+0x44/0x100 [btrfs]
[85162.475849]  [a0691a57]
btrfs_remove_block_group+0x137/0x740 [btrfs]
[85162.475964]  [a06ca8d2] btrfs_remove_chunk+0x672/0x780 [btrfs]
[85162.476065]  [a06922bf] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[85162.476172]  [a0699e0c] cleaner_kthread+0x12c/0x190 [btrfs]
[85162.476269]  [a0699ce0] ? check_leaf+0x350/0x350 [btrfs]
[85162.476355]  [8108f8d2] kthread+0xd2/0xf0
[85162.476424]  [8108f800] ? kthread_create_on_node+0x180/0x180
[85162.476519]  [8177bcbc] 

BtrFs on drives with error recovery control / TLER?

2015-01-15 Thread Daniel Pocock


Hi,

Can anybody comment on how BtrFs (particularly RAID1 mirroring)
interacts with drives that offer error recovery control (or TLER in WDC
terms)?

I generally prefer to buy this type of drive for any serious data
storage purposes

I notice ZFS gets a mention in the Wikipedia article about the topic:
http://en.wikipedia.org/wiki/Error_recovery_control

Should BtrFs be mentioned there too?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: price to pay for nocow file bit?

2015-01-15 Thread Chris Mason



On Thu, Jan 8, 2015 at 11:53 AM, Lennart Poettering 
lenn...@poettering.net wrote:
On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) 
wrote:



 On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
  Heya!
 
  Currently, systemd-journald's disk access patterns (appending to 
the

  end of files, then updating a few pointers in the front) result in
  awfully fragmented journal files on btrfs, which has a pretty
  negative effect on performance when accessing them.
 
  Now, to improve things a bit, I yesterday made a change to 
journald,

  to issue the btrfs defrag ioctl when a journal file is rotated,
  i.e. when we know that no further writes will be ever done on the
  file.
 
  However, I wonder now if I should go one step further even, and 
use
  the equivalent of chattr -C (i.e. nocow) on all journal files. 
I am

  wondering what price I would precisely have to pay for
  that. Judging by this earlier thread:
 
  http://www.spinics.net/lists/linux-btrfs/msg33134.html
 
  it's mostly about data integrity, which is something I can live 
with,
  given the conservative write patterns of journald, and the fact 
that

  we do our own checksumming and careful data validation. I mean, if
  btrfs in this mode provides no worse data integrity semantics than
  ext4 I am fully fine with losing this feature for these files.

 This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.


We already use fallocate(), but this is not enough on cow file
systems. With fallocate() you can certainly improve fragmentation when
appending things to a file. But on a COW file system this will help
little if we change things in the beginning of the file, since COW
means that it will then make a copy of those blocks and alter the
copy, but leave the original version unmodified. And if we do that all
the time the files get heavily fragmented, even though all the blocks
we modify have been fallocate()d initially...

 This would work on ext4, xfs, and others, and provide the same 
benefit

 (or even better) without filesystem-specific code.  journald would
 preallocate a contiguous chunk past the end of the file for appends,
 and


That's precisely what we do. But journald's write pattern is not
purely appending to files, it's append something to the end, then
link it up in the beginning. And for the append part we are
fine with fallocate(). It's the link up part that completely fucks
up fragmentation so far.


I think a per-file autodefrag flag would help a lot here.  We've made 
some improvements for autodefrag and slowly growing log files because 
we noticed that compression ratios on slowly growing files really 
weren't very good.  The problem was we'd never have more than a single 
block to compress, so the compression code would give up and write the 
raw data.


compression + autodefrag on the other hand would take 64-128K and recow 
it down, giving very good results.


The second problem we hit was with stable page writes.  If bdflush 
decides to write the last block in the file, it's really a wasted IO 
unless the block is fully filled.  We've been experimenting with a 
patch to leave the last block out of writepages unless its a 
fsync/O_SYNC.


I'll code up the per-file autodefrag, we've hit a few use cases that 
make sense.


-chris



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs command completion Was: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes

2015-01-15 Thread Duncan
David Sterba posted on Thu, 15 Jan 2015 13:05:46 +0100 as excerpted:

 A shell completion would be great of course, it's in the project ideas.
 There's a starting point
 http://www.spinics.net/lists/linux-btrfs/msg15899.html .

FWIW, in case anyone is interested...

What I did here is a bit different; shell completion would be better, but 
while I know bash, I don't know bash/shell completion so I couldn't write 
that without learning it.

What I did is a btrfs wrapper script (which I simply called 'b', a 
previously here-unused single letter command) that based on the first 
parameter or two and the number of parameters, decides if what's there 
matches a valid btrfs subcommand and whether it looks complete or not, 
and if it's valid but incomplete, echoes the appropriate btrfs sub help 
command and prompts for more input.

So just 'b' gives me:


b: btrfs helper (common commands only)

btrfs cmd (Just Enter for help):
b(alance) c(heck) d(ev) f(ilesystem) i(nspect) p(roperty)
qg(roup) qu(ota) rec(eive) rep(lace) resc(ue) rest(ore)
sc(rub) se(nd) su(bvolume) v(ersion):

There it waits for further input.

If I then add an 'f', or if I had typed 'b f' (or 'b fi' or
'b filesystem', the script checks status of btrfs cmd --help to see if 
it's valid or not), I get this:

f

btrfs filesystem action (Just Enter for help):
de(frag) df l(abel) r(esize) sh(ow) sy(nc) u(sage):

Further input.

If I then add 'df' (or if I had typed 'b fi df'), I get:

df

usage: btrfs filesystem df [options] path

Show space usage information for a mount point

-b|--raw   raw numbers in bytes
-h human friendly numbers, base 1024 (default)
-H human friendly numbers, base 1000
--iec  use 1024 as a base (KiB, MiB, GiB, TiB)
--si   use 1000 as a base (kB, MB, GB, TB)
-k|--kbytesshow sizes in KiB, or kB with --si
-m|--mbytesshow sizes in MiB, or MB with --si
-g|--gbytesshow sizes in GiB, or GB with --si
-t|--tbytesshow sizes in TiB, or TB with --si


btrfs filesystem df usage is printed above.
Please enter additional parameters here:

Further input.

At this point it doesn't check further input, instead simply echoing back 
what will be the final command and prompting whether to run it or not.  
If I simply type '/'...

/

btrfs filesystem df /
Final check: OK to run above command (y/N)?


Note the default to N...

If at that point I simply enter (or if I hit anything else besides y/Y), 
of course it doesn't run the command, it simply exits.  However, the 
built command was printed above, making it simple enough to select/paste 
(assuming gpm or a terminal window in X, thus mouse selection).

If I hit 'y', it executes the command.

Meanwhile, if there's more than two parameters and the first two 
validate, the script assumes the user knows what they are doing and 
simply executes it as-is.

b fi df /

b: btrfs helper (common commands only)

Data, RAID1: total=3.00GiB, used=1.74GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=768.00MiB, used=299.89MiB
GlobalReserve, single: total=48.00MiB, used=0.00B

b fi df invalid

b: btrfs helper (common commands only)

ERROR: can't access 'invalid'

b fxxx df /

This one's interesting.  The script checks the status of
btrfs fxxx --help and determines that btrfs doesn't consider the fxxx 
valid, so it simply prints the full btrfs --help output, running it thru 
$PAGER (if unset, less if it's executable, else no pager) due to length.


Anyone interested in a script such as this?  Absent a bash completion 
script, I found it /tremendously/ helpful with btrfs command basics, and 
because it prompts with the final command before execution, it helps in 
learning the commands as well.  I still use it for commands I don't use 
frequently enough to have memorized.  While it's a bit of a hack, thus my 
not posting it previously, if anyone else would find such a script 
useful, I could post it.

I don't have hosting for it but I suppose it could go on the wiki if 
enough other folks find it useful.  It's 125 lines including comments ATM.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe 

filesystem corruption/unable to scrub

2015-01-15 Thread Pavol Cupka
Hello,

I am having trouble with my btrfs setup. An unwanted reset probably
caused the corruption. I can mount the filesystem, but cannot perform
scrub as this ends with GPF.

uname -a
Linux sysresccd 3.14.24-alt441-amd64 #2 SMP Sun Nov 16 08:27:16 UTC
2014 x86_64 AMD Phenom(tm) II X4 965 Processor AuthenticAMD GNU/Linux

btrfs --version
Btrfs v3.17.1

btrfs fi show
Label: 'suc_storage'  uuid: 76bf605a-936b-4fce-8a74-1eb2c750f51c
Total devices 2 FS bytes used 2.57TiB
devid1 size 3.64TiB used 3.63TiB path /dev/sda1
devid2 size 3.64TiB used 3.63TiB path /dev/sdb1

btrfs fi df /home # Replace /home with the mount point of your btrfs-filesystem
Data, RAID1: total=3.63TiB, used=2.56TiB
System, RAID1: total=32.00MiB, used=532.00KiB
Metadata, RAID1: total=7.00GiB, used=5.83GiB

dmesg  dmesg.log

[14408.258515] general protection fault:  [#1] SMP
[14408.258520] Modules linked in: ppdev microcode parport_pc parport
acpi_cpufreq serio_raw edac_core edac_mce_amd k10temp sp5100_tco
i2c_piix4 shpchp raid10 raid456 async_raid6_recov async_pq async_xor
async_memcpy async_tx raid1 raid0 multipath linear ata_generic
pata_acpi usb_storage nouveau firewire_ohci firewire_core pata_atiixp
ttm drm_kms_helper drm r8169 mii i2c_algo_bit i2c_core mxm_wmi video
wmi
[14408.258543] CPU: 3 PID: 3100 Comm: btrfs-scrub-2 Not tainted
3.14.24-alt441-amd64 #2
[14408.258546] Hardware name: Gigabyte Technology Co., Ltd.
GA-MA785GT-UD3H/GA-MA785GT-UD3H, BIOS F8 05/25/2010
[14408.258549] task: 88007158a600 ti: 88011b9e4000 task.ti:
88011b9e4000
[14408.258551] RIP: 0010:[81423e91]  [81423e91]
scrub_bio_end_io_worker+0xb6/0x5fb
[14408.258558] RSP: :88011b9e5d28  EFLAGS: 00010287
[14408.258560] RAX: 8800779ace00 RBX: fffe880118114e40 RCX: 8800b79dd800
[14408.258562] RDX: 8800b79dd8a8 RSI: 0001 RDI: 0014
[14408.258564] RBP: 88011b9e5df8 R08: ea0004604518 R09: 88011b9e5ce8
[14408.258566] R10: 81420d8a R11: 88010001 R12: 8800779ac100
[14408.258568] R13:  R14: 8800779ac100 R15: 
[14408.258570] FS:  () GS:88013fcc()
knlGS:f75ccb40
[14408.258572] CS:  0010 DS:  ES:  CR0: 8005003b
[14408.258574] CR2: f77c6000 CR3: 3ee94000 CR4: 07e0
[14408.258575] Stack:
[14408.258577]  88011b9e5d88 88011b9e5da8 880139ab7200
8801
[14408.258581]  880118114f00 8800b79dd940 8800b7fa1800
000100d8f612
[14408.258583]  8800b79dd8a8 00158107d8ca 8800b79dd800
8800b7fa1800
[14408.258586] Call Trace:
[14408.258592]  [817b73c1] ? schedule_timeout+0xa1/0xbd
[14408.258596]  [8107d3aa] ? lock_timer_base+0x4d/0x4d
[14408.258600]  [81401d95] worker_loop+0x194/0x527
[14408.258604]  [81401c01] ? btrfs_queue_worker+0x239/0x239
[14408.258607]  [8108eb43] kthread+0xc9/0xd1
[14408.258611]  [8108ea7a] ? kthread_freezable_should_stop+0x60/0x60
[14408.258614]  [817c11cc] ret_from_fork+0x7c/0xb0
[14408.258617]  [8108ea7a] ? kthread_freezable_should_stop+0x60/0x60
[14408.258618] Code: 02 48 8b 09 80 a1 98 00 00 00 fb 48 8b 4d 80 48
83 c2 08 3b 81 38 01 00 00 7c dc eb 9c 48 8b 95 70 ff ff ff 48 8b 42
38 48 8b 18 f0 ff 8b 84 00 00 00 0f 94 c0 84 c0 0f 84 2c 04 00 00 f6
83 98
[14408.258639] RIP  [81423e91] scrub_bio_end_io_worker+0xb6/0x5fb
[14408.258642]  RSP 88011b9e5d28
[14408.258645] ---[ end trace 168b2c0c1e0d1fcc ]---

Running btrfsck with --repair ends also with errors.
...
Device extent[2, 1522566955008, 1073741824] didn't find its device.
Device extent[2, 1523640696832, 1073741824] didn't find its device.
Device extent[2, 1524714438656, 1073741824] didn't find its device.
Device extent[2, 1525788180480, 1073741824] didn't find its device.
Device extent[2, 1526861922304, 1073741824] didn't find its device.
Device extent[2, 1527935664128, 1073741824] didn't find its device.
Errors found in extent allocation tree or chunk allocation
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
bad key ordering 1 2
Deleting bad dir index [618631,96,57124] root 270
volumes.c:978: btrfs_alloc_chunk: Assertion `ret` failed.
btrfs check[0x8083f7d]
btrfs check[0x8087624]
btrfs check[0x807d132]
btrfs check[0x807d4fc]
btrfs check[0x807dfc4]
btrfs check[0x80710d7]
btrfs check[0x8071774]
btrfs check[0x80737c0]
btrfs check[0x808060a]
btrfs check[0x805f56d]
btrfs check[0x8061f82]
btrfs check[0x8067797]
btrfs check[0x804afd9]
/lib/libc.so.6(__libc_start_main+0xe6)[0xf7619346]
btrfs check[0x804ac31]

Could you please help me as how could I correct current state.

Thank you in advance
Pavol
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: BtrFs on drives with error recovery control / TLER?

2015-01-15 Thread Duncan
Daniel Pocock posted on Thu, 15 Jan 2015 20:54:10 +0100 as excerpted:

 Can anybody comment on how BtrFs (particularly RAID1 mirroring)
 interacts with drives that offer error recovery control (or TLER in WDC
 terms)?
 
 I generally prefer to buy this type of drive for any serious data
 storage purposes
 
 I notice ZFS gets a mention in the Wikipedia article about the topic:
 http://en.wikipedia.org/wiki/Error_recovery_control
 
 Should BtrFs be mentioned there too?

I make no claims to being an expert in this area and others with more 
expertise will likely be along shortly.  However...

In general you have a valid worry, and the recommendation is as with 
other raid technology, if possible, set your device to a recovery time 
under 30 seconds, as that's the default Linux SCSI level link reset time, 
and it will short-circuit the process and doesn't get the bad sector 
marked as such and remapped to a reserve sector, on the device.

On consumer level devices where setting the device recovery time isn't 
possible, the hard-wired recovery time can be near two minutes, so the 
recommendation is to set the Linux SCSI level link reset time to 120 
seconds or so, thus allowing the hardware device to timeout first so it 
can again recognize the bad sector and do its remapping thing.

In general, this recommendation should apply to all Linux-kernel-based 
soft-raid technologies (including btrfs, mdraid, dmraid...) where the 
raid redundancy can fill in the missing data so letting it fail and 
potentially trigger a remap is the best strategy.

OTOH, the shorter time wouldn't be recommended (tho a longer SCSI reset 
time well could be) for a single-device btrfs or a multi-device btrfs in 
raid0 or single mode, because in those cases, the assumption is that 
there's no other copies of the data, so letting the device take up to two 
minutes to try to retrieve that data in the hope that the extra tries 
will finally be successful, can very possibly save that data... of course 
at the cost of a system that goes unresponsive for upto two minutes at a 
time, which clearly isn't going to work if it's happening frequently.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-15 Thread Chris Samuel
On 15/01/15 21:48, David Sterba wrote:

 Chandan, please drop the btrfs_inode_otime helper and resend. Thanks.

Thanks!

Sorry I'd had no further time to look at this, I've been fully committed
with $DAY_JOB and on a number of projects with our local community
observatory (if anyone is in/visiting Melbourne and into astronomy ping
me for details).

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug in 3.19-rc4

2015-01-15 Thread Satoru Takeuchi

Hi,

On 2015/01/16 10:05, Tomasz Chmielewski wrote:

I just started some btrfs stress testing on latest linux kernel 3.19-rc4:
A few hours later, filesystem stopped working - the kernel bug report
can be found below.


Hi,

your kernel BUG at fs/btrfs/inode.c:3142! from 3.19-rc4 corresponds to 
http://marc.info/?l=linux-btrfsm=141903172106342w=2 - it was kernel BUG at 
/home/apw/COD/linux/fs/btrfs/inode.c:3123! in 3.18.1, and is exactly the same code in both cases:


 /* grab metadata reservation from transaction handle */
 if (reserve) {
 ret = btrfs_orphan_reserve_metadata(trans, inode);
 BUG_ON(ret); /* -ENOSPC in reservation; Logic error? JDM */
 }


Year, it's the same.

BTW, I've tried to reproduce this problem by using the way
you told me. However, it hasn't reproduced yet.

Thanks,
Satoru

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug in 3.19-rc4

2015-01-15 Thread Satoru Takeuchi

Hi Marcel,

On 2015/01/16 4:46, Marcel Ritter wrote:

Hi,

I just started some btrfs stress testing on latest linux kernel 3.19-rc4:
A few hours later, filesystem stopped working - the kernel bug report
can be found below.

The test consists of one massive IO thread (writing 100GB files with dd),
and 2 tar instances extracting kernel sources and deleting them afterwards
(I can provide the simple bash script doing this, if needed).


Could you give me this script?

Thanks,
Satoru



System information (Ubuntu 14.04.1, latest kernel):

root@thunder # uname -a
Linux thunder 3.19.0-rc4-custom #1 SMP Mon Jan 12 16:13:44 CET 2015
x86_64 x86_64 x86_64 GNU/Linux

root@thunder # /root/btrfs-progs/btrfs --version
Btrfs v3.18-36-g0173148

Tests are done on 14 SCSI disks, using raid6 for data and metadata:

root@thunder # /root/btrfs-progs/btrfs fi show
Label: 'raid6'  uuid: cbe34d2b-5f75-46cf-9263-9813028ebc19
 Total devices 14 FS bytes used 674.62GiB
 devid1 size 279.39GiB used 59.24GiB path /dev/cciss/c1d0
 devid2 size 279.39GiB used 59.22GiB path /dev/cciss/c1d1
 devid3 size 279.39GiB used 59.22GiB path /dev/cciss/c1d10
 devid4 size 279.39GiB used 59.22GiB path /dev/cciss/c1d11
 devid5 size 279.39GiB used 59.22GiB path /dev/cciss/c1d12
 devid6 size 279.39GiB used 59.22GiB path /dev/cciss/c1d13
 devid7 size 279.39GiB used 59.22GiB path /dev/cciss/c1d2
 devid8 size 279.39GiB used 59.22GiB path /dev/cciss/c1d3
 devid9 size 279.39GiB used 59.22GiB path /dev/cciss/c1d4
 devid   10 size 279.39GiB used 59.22GiB path /dev/cciss/c1d5
 devid   11 size 279.39GiB used 59.22GiB path /dev/cciss/c1d6
 devid   12 size 279.39GiB used 59.22GiB path /dev/cciss/c1d7
 devid   13 size 279.39GiB used 59.22GiB path /dev/cciss/c1d8
 devid   14 size 279.39GiB used 59.22GiB path /dev/cciss/c1d9

Btrfs v3.18-36-g0173148

# This is provided for completeness only, and is taken
# somewhen *before* the kernel crash occured, so basic
# setup is the same, but allocated/free sizes won't match
root@thunder # /root/btrfs-progs/btrfs fi df /tmp/m
Data, single: total=8.00MiB, used=0.00B
Data, RAID6: total=727.45GiB, used=697.84GiB
System, single: total=4.00MiB, used=0.00B
System, RAID6: total=13.50MiB, used=64.00KiB
Metadata, single: total=8.00MiB, used=0.00B
Metadata, RAID6: total=3.43GiB, used=805.91MiB
GlobalReserve, single: total=272.00MiB, used=0.00B


Here's what happens after some hours of stress testing:

[85162.472989] [ cut here ]
[85162.473071] kernel BUG at fs/btrfs/inode.c:3142!
[85162.473139] invalid opcode:  [#1] SMP
[85162.473212] Modules linked in: btrfs(E) xor(E) raid6_pq(E)
radeon(E) ttm(E) drm_kms_helper(E) drm(E) hpwdt(E) amd64_edac_mod(E)
kvm(E) edac_core(E) shpchp(E) k8temp(E) serio_raw(E) hpilo(E)
edac_mce_amd(E) mac_hid(E) i2c_algo_bit(E) ipmi_si(E) nfsd(E)
auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) lp(E)
fscache(E) parport(E) hid_generic(E) usbhid(E) hid(E) hpsa(E)
psmouse(E) bnx2(E) cciss(E) pata_acpi(E) pata_amd(E)
[85162.473911] CPU: 4 PID: 3039 Comm: btrfs-cleaner Tainted: G
E  3.19.0-rc4-custom #1
[85162.474028] Hardware name: HP ProLiant DL585 G2   , BIOS A07 05/02/2011
[85162.474122] task: 88085b054aa0 ti: 88205ad4c000 task.ti:
88205ad4c000
[85162.474230] RIP: 0010:[a06a8182]  [a06a8182]
btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[85162.474422] RSP: 0018:88205ad4fc48  EFLAGS: 00010286
[85162.474497] RAX: ffe4 RBX: 8810a35d42f8 RCX: 88185b896000
[85162.474595] RDX: 6a54 RSI: 0004 RDI: 88185b896138
[85162.474694] RBP: 88205ad4fc88 R08: 0001e670 R09: 88016194b240
[85162.474793] R10: a06bd797 R11: ea0004f71800 R12: 88185baa2000
[85162.474892] R13: 88085f6d7630 R14: 88185baa2458 R15: 0001
[85162.474992] FS:  7fb3f27fb740() GS:88085fd0()
knlGS:
[85162.475105] CS:  0010 DS:  ES:  CR0: 8005003b
[85162.475184] CR2: 7f896c02c220 CR3: 00085b328000 CR4: 07e0
[85162.475286] Stack:
[85162.475318]  88205ad4fc88 a06e6a14 88185b896b04
88105b03e800
[85162.475442]  88016194b240 8810a35d42f8 881e8ffe9a00
88133dc48ea0
[85162.475561]  88205ad4fd18 a0691a57 88016194b244
88016194b240
[85162.475680] Call Trace:
[85162.475738]  [a06e6a14] ?
lookup_free_space_inode+0x44/0x100 [btrfs]
[85162.475849]  [a0691a57]
btrfs_remove_block_group+0x137/0x740 [btrfs]
[85162.475964]  [a06ca8d2] btrfs_remove_chunk+0x672/0x780 [btrfs]
[85162.476065]  [a06922bf] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[85162.476172]  [a0699e0c] cleaner_kthread+0x12c/0x190 [btrfs]
[85162.476269]  [a0699ce0] ? check_leaf+0x350/0x350 [btrfs]
[85162.476355]  [8108f8d2] 

Re: Kernel bug in 3.19-rc4

2015-01-15 Thread Tomasz Chmielewski
I just started some btrfs stress testing on latest linux kernel 
3.19-rc4:

A few hours later, filesystem stopped working - the kernel bug report
can be found below.


Hi,

your kernel BUG at fs/btrfs/inode.c:3142! from 3.19-rc4 corresponds to 
http://marc.info/?l=linux-btrfsm=141903172106342w=2 - it was kernel 
BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123! in 3.18.1, and is 
exactly the same code in both cases:



/* grab metadata reservation from transaction handle */
if (reserve) {
ret = btrfs_orphan_reserve_metadata(trans, inode);
BUG_ON(ret); /* -ENOSPC in reservation; Logic error? JDM 
*/

}


--
Tomasz Chmielewski
http://www.sslrack.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstest: btrfs/006: Add extra check on return value and 'fi show' by device

2015-01-15 Thread Qu Wenruo
Reported in Red Hat BZ#1181627, 'btrfs fi show' on unmounted device will
return 1 even no error happens.

Introduced by: commit 2513077f
btrfs-progs: fix device missing of btrfs fi show with seed devices

Patch fixing it:
https://patchwork.kernel.org/patch/5626001/
btrfs-progs: Fix wrong return value when executing 'fi show' on
umounted device.

Reported-by: Vratislav Podzimek vpodz...@redhat.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 tests/btrfs/006 | 51 ---
 tests/btrfs/006.out | 10 ++
 2 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/tests/btrfs/006 b/tests/btrfs/006
index 715fd80..2d8c1c0 100755
--- a/tests/btrfs/006
+++ b/tests/btrfs/006
@@ -62,33 +62,70 @@ _scratch_pool_mkfs  $seqres.full 21 || _fail mkfs 
failed
 
 # These have to be done unmounted...?
 echo == Set filesystem label to $LABEL
-$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV $LABEL
+$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV $LABEL || \
+   _fail set lable failed
 echo == Get filesystem label
-$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV
+$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV || \
+   _fail get lable failed
+
+echo == Show filesystem by device(offline)
+$BTRFS_UTIL_PROG filesystem show $FIRST_POOL_DEV | \
+   _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show filesystem by device(offline) return value wrong
 
 echo == Mount.
 _scratch_mount
 
 echo == Show filesystem by label
-$BTRFS_UTIL_PROG filesystem show $LABEL | _filter_btrfs_filesystem_show 
$TOTAL_DEVS
+$BTRFS_UTIL_PROG filesystem show $LABEL | \
+   _filter_btrfs_filesystem_show $TOTAL_DEVS
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show filesystem by lable return value wrong
 UUID=`$BTRFS_UTIL_PROG filesystem show $LABEL | grep uuid: | awk '{print $NF}'`
 
 echo UUID $UUID  $seqres.full
 
 echo == Show filesystem by UUID
-$BTRFS_UTIL_PROG filesystem show $UUID | _filter_btrfs_filesystem_show 
$TOTAL_DEVS $UUID
+$BTRFS_UTIL_PROG filesystem show $UUID | \
+   _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show filesystem by UUID return value wrong
+
+echo == Show filesystem by device(online)
+$BTRFS_UTIL_PROG filesystem show $FIRST_POOL_DEV | \
+   _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show filesystem by UUID return value wrong
 
 echo == Sync filesystem
 $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail sync filesystem failed
+
 
 echo == Show device stats by mountpoint
-$BTRFS_UTIL_PROG device stats $SCRATCH_MNT | _filter_btrfs_device_stats 
$TOTAL_DEVS
+$BTRFS_UTIL_PROG device stats $SCRATCH_MNT | \
+   _filter_btrfs_device_stats $TOTAL_DEVS
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show device status return value wrong
+
 echo == Show device stats by first/scratch dev
 $BTRFS_UTIL_PROG device stats $SCRATCH_DEV | _filter_btrfs_device_stats
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show device status return value wrong
+
 echo == Show device stats by second dev
-$BTRFS_UTIL_PROG device stats $FIRST_POOL_DEV | sed -e 
s,$FIRST_POOL_DEV,FIRST_POOL_DEV,g
+$BTRFS_UTIL_PROG device stats $FIRST_POOL_DEV | \
+   sed -e s,$FIRST_POOL_DEV,FIRST_POOL_DEV,g
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show device status return value wrong
+
 echo == Show device stats by last dev
-$BTRFS_UTIL_PROG device stats $LAST_POOL_DEV | sed -e 
s,$LAST_POOL_DEV,LAST_POOL_DEV,g
+$BTRFS_UTIL_PROG device stats $LAST_POOL_DEV | \
+   sed -e s,$LAST_POOL_DEV,LAST_POOL_DEV,g
+[ ${PIPESTATUS[0]} -ne 0 ]  \
+   _fail show device status return value wrong
 
 # success, all done
 status=0
diff --git a/tests/btrfs/006.out b/tests/btrfs/006.out
index 22bcb77..497de67 100644
--- a/tests/btrfs/006.out
+++ b/tests/btrfs/006.out
@@ -2,6 +2,11 @@
 == Set filesystem label to TestLabel.006
 == Get filesystem label
 TestLabel.006
+== Show filesystem by device(offline)
+Label: 'TestLabel.006'  uuid: UUID
+   Total devices EXACTNUM FS bytes used SIZE
+   devid DEVID size SIZE used SIZE path SCRATCH_DEV
+
 == Mount.
 == Show filesystem by label
 Label: 'TestLabel.006'  uuid: UUID
@@ -13,6 +18,11 @@ Label: 'TestLabel.006'  uuid: EXACTUUID
Total devices EXACTNUM FS bytes used SIZE
devid DEVID size SIZE used SIZE path SCRATCH_DEV
 
+== Show filesystem by device(online)
+Label: 'TestLabel.006'  uuid: EXACTUUID
+   Total devices EXACTNUM FS bytes used SIZE
+   devid DEVID size SIZE used SIZE path SCRATCH_DEV
+
 == Sync filesystem
 FSSync 'SCRATCH_MNT'
 == Show device stats by mountpoint
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status reports not running when it is

2015-01-15 Thread David Sterba
On Wed, Jan 14, 2015 at 02:27:17PM -0800, Zach Brown wrote:
 On Wed, Jan 14, 2015 at 04:06:02PM -0500, Sandy McArthur Jr wrote:
  Sometimes btrfs scrub status reports that is not running when it still is.
  
  I think this a cosmetic bug. And I believe this is related to the
  scrub completing on some drives before others in a multi-drive btrfs
  filesystem that is not well balanced.
 
 Boy, I don't really know this code, but it looks like:
 
 if (ss-in_progress)
   printf(, running for %llu seconds\n, ss-duration);
 else
   printf(, interrupted after %llu seconds, not running\n,
   ss-duration);
 
 in_progress = is_scrub_running_in_kernel(fdmnt, di_args, fi_args.num_devices);
 
 static int is_scrub_running_in_kernel(int fd,
 struct btrfs_ioctl_dev_info_args *di_args, u64 max_devices)
 {
 struct scrub_progress sp;
 int i;
 int ret;
 
 for (i = 0; i  max_devices; i++) {
 memset(sp, 0, sizeof(sp));
 sp.scrub_args.devid = di_args[i].devid;
 ret = ioctl(fd, BTRFS_IOC_SCRUB_PROGRESS, sp.scrub_args);
 if (ret  0  errno == ENODEV)
 continue;
 if (ret  0  errno == ENOTCONN)
 return 0;
 
 It says that scrub isn't running if any devices have completed.  If you drop
 all those ret  0 conditional branches that are either noops or wrong, does it
 work like you'd expect?

Why wrong? The ioctl callback returns -ENODEV or -ENOTCONN that get
translated to the errno values and ioctl(...) returns -1 in both cases.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: fix test btrfs/017 (qgroup shared extent accounting test)

2015-01-15 Thread David Sterba
On Wed, Jan 14, 2015 at 11:21:43PM +, Filipe Manana wrote:
 Currently this test fails on 2 situations:
 
 1) The scratch device supports trim/discard. In this case any modern
version of mkfs.btrfs outputs a message (to stderr) informing that
a trim is performed, which the golden output doesn't expect:
 
btrfs/017   - output mismatch (see 
 /git/xfstests/results//btrfs/017.out.bad)
--- tests/btrfs/017.out2015-01-06 11:14:22.730143144 +
+++ /git/xfstests/results//btrfs/017.out.bad   2015-01-14 
 22:33:01.582195719 +
@@ -1,4 +1,5 @@
 QA output created by 017
+Performing full device TRIM (100.00GiB) ...
 wrote 8192/8192 bytes at offset 0
   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 4096 4096
 ...
 (Run 'diff -u tests/btrfs/017.out 
 /git/xfstests/results//btrfs/017.out.bad'  to see the entire diff)
 
 So like others tests do, just redirect mkfs' standard error.
 
 2) On platforms with a page size greater than 4Kb. At the moment btrfs
doesn't support a node/leaf size smaller than the page size, but it
supports a larger one. So use the max supported node size (64Kb) so
that the test runs on any platform currently supported by Linux.
 
 Signed-off-by: Filipe Manana fdman...@suse.com
Reviewed-by: David Sterba dste...@suse.cz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-15 Thread David Sterba
On Fri, Jan 09, 2015 at 05:11:42PM +0100, David Sterba wrote:
  --- a/fs/btrfs/inode.c
  +++ b/fs/btrfs/inode.c
  @@ -5835,6 +5835,11 @@ static struct inode *btrfs_new_inode(struct 
  btrfs_trans_handle *trans,
   sizeof(*inode_item));
  fill_inode_item(trans, path-nodes[0], inode_item, inode);
   
  +   /*
  +* Set the creation time on the inode.
  +*/
  +   btrfs_set_stack_timespec_sec( inode.otime, cur_time.tv_sec );
 
 Drop the spaces after/before parens and also set usec the same way.
 There's no such thing as 'current_time', only CURRENT_TIME but that
 cannot be used directly as a structure.
 
 Given that the mtime is set a few lines above, copy the tv_sec and
 tv_usec from there.

chandan pointed out on IRC the other day that he'd sent a patch for that
already

http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg17508.html

Though the patch cannot be applied as-is, it's more complete (I've
missed a few places where the otime has to be set).

Chandan, please drop the btrfs_inode_otime helper and resend. Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2 RESEND] btrfs: remove empty fs_devices to prevent memory runout

2015-01-15 Thread Gui Hecheng
There is a global list @fs_uuids to keep @fs_devices object
for each created btrfs. But when a btrfs becomes empty
(all devices belong to it are gone), its @fs_devices remains
in @fs_uuids list until module exit.
If we keeps mkfs.btrfs on the same device again and again,
all empty @fs_devices produced are sure to eat up our memory.
So this case has better to be prevented.

I think that each time we setup btrfs on that device, we should
check whether we are stealing some device from another btrfs
seen before. To faciliate the search procedure, we could insert
all @btrfs_device in a rb_root, one @btrfs_device per each physical
device, with @bdev-bd_dev as key. Each time device stealing happens,
we should replace the corresponding @btrfs_device in the rb_root with
an up-to-date version.
If the stolen device is the last device in its @fs_devices,
then we have an empty btrfs to be deleted.

Actually there are 3 ways to steal devices and lead to empty btrfs
1. mkfs, with -f option
2. device add, with -f option
3. device replace, with -f option
We should act under these cases.

Moreover, there are special cases to consider:
o If there are seed devices, then it is asured that
  the devices in cloned @fs_devices are not treated as valid devices.
o If a device disappears and reappears without any touch, its
  @bdev-bd_dev may change, so we have to re-insert it into the rb_root.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
changelog
v1-v2: add handle for device disappears and reappears event

*Note*
Actually this handles the case when a device disappears and
reappears without any touch.
We are going to recycle all dead btrfs_device in another patch.
Two events leads to the deads:
1) device disappears and never returns again
2) device disappears and returns with a new fs on it
A shrinker shall kill the deads.
---
 fs/btrfs/super.c   |   1 +
 fs/btrfs/volumes.c | 281 ++---
 fs/btrfs/volumes.h |   6 ++
 3 files changed, 230 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 60f7cbe..001cba5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2184,6 +2184,7 @@ static void __exit exit_btrfs_fs(void)
btrfs_end_io_wq_exit();
unregister_filesystem(btrfs_fs_type);
btrfs_exit_sysfs();
+   btrfs_cleanup_valid_dev_root();
btrfs_cleanup_fs_uuids();
btrfs_exit_compress();
btrfs_hash_exit();
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0144790..228a7e0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include linux/kthread.h
 #include linux/raid/pq.h
 #include linux/semaphore.h
+#include linux/rbtree.h
 #include asm/div64.h
 #include ctree.h
 #include extent_map.h
@@ -52,6 +53,126 @@ static void btrfs_dev_stat_print_on_load(struct 
btrfs_device *device);
 
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
+static struct rb_root valid_dev_root = RB_ROOT;
+
+static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev)
+{
+   struct rb_node **p;
+   struct rb_node *parent;
+   struct rb_node *new;
+   struct btrfs_device *old_dev;
+
+   WARN_ON(!mutex_is_locked(uuid_mutex));
+
+   parent = NULL;
+   new = new_dev-rb_node;
+
+   p = valid_dev_root.rb_node;
+   while (*p) {
+   parent = *p;
+   old_dev = rb_entry(parent, struct btrfs_device, rb_node);
+
+   if (new_dev-devnum  old_dev-devnum)
+   p = parent-rb_left;
+   else if (new_dev-devnum  old_dev-devnum)
+   p = parent-rb_right;
+   else {
+   rb_replace_node(parent, new, valid_dev_root);
+   RB_CLEAR_NODE(parent);
+
+   goto out;
+   }
+   }
+
+   old_dev = NULL;
+   rb_link_node(new, parent, p);
+   rb_insert_color(new, valid_dev_root);
+
+out:
+   return old_dev;
+}
+
+static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
+{
+   struct btrfs_device *device;
+   WARN_ON(fs_devices-opened);
+   while (!list_empty(fs_devices-devices)) {
+   device = list_entry(fs_devices-devices.next,
+   struct btrfs_device, dev_list);
+   list_del(device-dev_list);
+   rcu_string_free(device-name);
+   kfree(device);
+   }
+   kfree(fs_devices);
+}
+
+static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs)
+{
+   struct btrfs_fs_devices *seed_fs;
+
+   if (!list_empty(old_fs-devices))
+   return;
+
+   list_del(old_fs-list);
+
+   /* free the seed clones */
+   seed_fs = old_fs-seed;
+   free_fs_devices(old_fs);
+   while (seed_fs) {
+   old_fs = seed_fs;
+   

[PATCH 2/2 RESEND] btrfs: introduce shrinker for rb_tree that keeps valid btrfs_devices

2015-01-15 Thread Gui Hecheng
The following patch:
btrfs: remove empty fs_devices to prevent memory runout

introduces @valid_dev_root aiming at recording @btrfs_device objects that
have corresponding block devices with btrfs.
But if a block device is broken or unplugged, no one tells the
@valid_dev_root to cleanup the dead objects.

To recycle the memory occuppied by those deads, we could rely on
the shrinker. The shrinker's scan function will traverse the
@valid_dev_root and trys to open the devices one by one, if it fails
or encounters a non-btrfs it will remove the dead @btrfs_device.

A special case to deal with is that a block device is unplugged and
replugged, then it appears with a new @bdev-bd_dev as devnum.
In this case, we should remove the older since we should have a new
one for that block device already.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 fs/btrfs/super.c   | 10 
 fs/btrfs/volumes.c | 74 +-
 fs/btrfs/volumes.h |  4 +++
 3 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 001cba5..022381e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2017,6 +2017,12 @@ static struct miscdevice btrfs_misc = {
.fops   = btrfs_ctl_fops
 };
 
+static struct shrinker btrfs_valid_dev_shrinker = {
+   .scan_objects = btrfs_valid_dev_scan,
+   .count_objects = btrfs_valid_dev_count,
+   .seeks = DEFAULT_SEEKS,
+};
+
 MODULE_ALIAS_MISCDEV(BTRFS_MINOR);
 MODULE_ALIAS(devname:btrfs-control);
 
@@ -2130,6 +2136,8 @@ static int __init init_btrfs_fs(void)
 
btrfs_init_lockdep();
 
+   register_shrinker(btrfs_valid_dev_shrinker);
+
btrfs_print_info();
 
err = btrfs_run_sanity_tests();
@@ -2143,6 +2151,7 @@ static int __init init_btrfs_fs(void)
return 0;
 
 unregister_ioctl:
+   unregister_shrinker(btrfs_valid_dev_shrinker);
btrfs_interface_exit();
 free_end_io_wq:
btrfs_end_io_wq_exit();
@@ -2183,6 +2192,7 @@ static void __exit exit_btrfs_fs(void)
btrfs_interface_exit();
btrfs_end_io_wq_exit();
unregister_filesystem(btrfs_fs_type);
+   unregister_shrinker(btrfs_valid_dev_shrinker);
btrfs_exit_sysfs();
btrfs_cleanup_valid_dev_root();
btrfs_cleanup_fs_uuids();
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 228a7e0..5462557 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -54,6 +54,7 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device 
*device);
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
 static struct rb_root valid_dev_root = RB_ROOT;
+static atomic_long_t unopened_dev_count = ATOMIC_LONG_INIT(0);
 
 static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev)
 {
@@ -130,6 +131,8 @@ static void free_invalid_device(struct btrfs_device 
*invalid_dev)
 {
struct btrfs_fs_devices *old_fs;
 
+   atomic_long_dec(unopened_dev_count);
+
old_fs = invalid_dev-fs_devices;
mutex_lock(old_fs-device_list_mutex);
list_del(invalid_dev-dev_list);
@@ -605,6 +608,7 @@ static noinline int device_list_add(const char *path,
list_add_rcu(device-dev_list, fs_devices-devices);
fs_devices-num_devices++;
mutex_unlock(fs_devices-device_list_mutex);
+   atomic_long_inc(unopened_dev_count);
 
ret = 1;
device-fs_devices = fs_devices;
@@ -778,6 +782,7 @@ again:
blkdev_put(device-bdev, device-mode);
device-bdev = NULL;
fs_devices-open_devices--;
+   atomic_long_inc(unopened_dev_count);
}
if (device-writeable) {
list_del_init(device-dev_alloc_list);
@@ -840,8 +845,10 @@ static int __btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
struct btrfs_device *new_device;
struct rcu_string *name;
 
-   if (device-bdev)
+   if (device-bdev) {
fs_devices-open_devices--;
+   atomic_long_inc(unopened_dev_count);
+   }
 
if (device-writeable 
device-devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -971,6 +978,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
fs_devices-rotating = 1;
 
fs_devices-open_devices++;
+   atomic_long_dec(unopened_dev_count);
if (device-writeable 
device-devid != BTRFS_DEV_REPLACE_DEVID) {
fs_devices-rw_devices++;
@@ -6848,3 +6856,67 @@ void btrfs_update_commit_device_bytes_used(struct 
btrfs_root *root,
}
unlock_chunks(root);
 }
+
+static unsigned long shrink_valid_dev_root(void)
+{
+   struct rb_node *n;
+   struct btrfs_device *device;
+   struct 

Re: [PATCH 04/15] Btrfs: add ref_count and free function for btrfs_bio

2015-01-15 Thread David Sterba
The cleanups look good in general, some minor nitpicks below.

On Tue, Jan 13, 2015 at 08:34:37PM +0800, Zhaolei wrote:
 - kfree(bbio);
 + put_btrfs_bio(bbio);

Please rename it to btrfs_put_bbio, this is more consistent with other
*_put_* helpers and 'bbio' distinguishes btrfs_bio from regular 'bio'.

  
  static void btrfs_end_bio(struct bio *bio, int err)
 diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
 index fb0e8c3..db195f0 100644
 --- a/fs/btrfs/volumes.h
 +++ b/fs/btrfs/volumes.h
 @@ -295,6 +295,7 @@ typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, 
 int err);
  #define BTRFS_BIO_ORIG_BIO_SUBMITTED (1  0)
  
  struct btrfs_bio {
 + atomic_t ref_count;

atomic_t refs;

   atomic_t stripes_pending;
   struct btrfs_fs_info *fs_info;
   bio_end_io_t *end_io;
 @@ -394,13 +395,8 @@ struct btrfs_balance_control {
  
  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
  u64 end, u64 *length);
 -
 -#define btrfs_bio_size(total_stripes, real_stripes)  \
 - (sizeof(struct btrfs_bio) + \
 -  (sizeof(struct btrfs_bio_stripe) * (total_stripes)) +  \
 -  (sizeof(int) * (real_stripes)) +   \
 -  (sizeof(u64) * (real_stripes)))
 -
 +void get_btrfs_bio(struct btrfs_bio *bbio);

btrfs_get_bbio

 +void put_btrfs_bio(struct btrfs_bio *bbio);
  int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
   u64 logical, u64 *length,
   struct btrfs_bio **bbio_ret, int mirror_num);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes

2015-01-15 Thread David Sterba
On Thu, Jan 15, 2015 at 09:17:01AM +0800, Fan Chengniang/樊成酿 wrote:
 
 在 2015年01月14日 23:46, David Sterba 写道:
  On Tue, Jan 13, 2015 at 01:53:39PM +0800, Fan Chengniang wrote:
  make btrfs qgroups show human readable sizes
  using --human-readable option, example:
  That's too long to type and the idea was to add all the long options
  that force the specific unit base, ie. --kbytes/--mbytes/..., --raw,
  --si and --iec. We can possibly make the human readable the default
  because that's what I'd expect to see to have a quick overview and can
  use the other options otherwise.
 
  The geopt parser accepts short options if they're unique, so --kb or
  even --k works as a very convenient shorcut for frequent commandline
  use.
 I have sent a mail for your advise of adding options. In that mail, I 
 asked whether I should use --human-readable and add --kbytes --mbytes ...
 But you have not reply to me.

So you've sent a v2 where we can see if our ideas match or not and
continue from there. Timely replies are not always feasible, I get a lot
of mails. Patch iterations are normal, nothing new here.

 So, your advise is add --kbytes --mbytes ... and make human-readable 
 default behaviour?
  qgroupid rfer excl max_rfer max_excl parent  child
       --  -
  0/5  299.58MiB299.58MiB400.00MiB0.00B1/1 ---
  0/265299.58MiB16.00KiB 0.00B320.00MiB1/1 ---
  0/266299.58MiB16.00KiB 350.00MiB0.00B--- ---
  1/1  599.16MiB299.59MiB800.00MiB0.00B--- 
  0/5,0/265
  The values should be also aligned to the right.
 It is aligned to left before my patch. I just keep it.

Ok, take it as a hint for another patch.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-15 Thread P. Remek
Hello,


 Could you check how many extents with BTRFS and Ext4:
 # filefrag test1

So my findings are odd:

On BTRFS when I run fio with a single worker thread (target file is
12GB large,and its 100% random write of 4kb blocks), then number of
extents reported by filefrag is around 3.
However when I do the same with 4 worker threads, I get some crazy
number of extents - test1: 3141866 extents found. Also when running
with 4 threads when I check CPU, the sys% utilization takes 80% of CPU
( in the top output I see that all is consumed by kworker processes).

On the EXT4 I get only 13 extents even when running with 4 worker
threads. (note that I created RAID10 using mdadm before setting up
ext4 there in order to get comparable storage solution to what we
test with  BTRFS).

Another odd thing is, that it takes very long time for the filefrag
utility to return the result on the BTRFS and not only for the case
where I got 3 milions of extents but also for the first case where I
ran single worker and the number of extents was only 3. Filefrag on
EXT4 returns immediately.


 To see if this is because bad fragments for BTRFS. I am still not
 sure how fio will test randwrite here, so here is possibilities:

 case1:
  if fio don’t repeat write same position for several time, i think
  you could add --overite=0, and retest to see if it helps.

Not sure  what parameter do you mean here.

 case2:
 if fio randwrite did write same position for several time, i think
 you could use ‘-o nodatacow’ mount option to verify if this is because
 BTRFS COW caused serious fragments.


It seems that mounting it with this option does have some effect but
not very significant and it is not very deterministic. The IOPs are
slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance
is very spiky and I can still see that CPU sys% is very high. As soon
as the kworker threads start consuming CPU, the IOPs performance goes
down again to some ~15 000 IOPs.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes

2015-01-15 Thread 樊成酿


在 2015年01月15日 20:30, David Sterba 写道:

On Thu, Jan 15, 2015 at 09:17:01AM +0800, Fan Chengniang/樊成酿 wrote:

在 2015年01月14日 23:46, David Sterba 写道:

On Tue, Jan 13, 2015 at 01:53:39PM +0800, Fan Chengniang wrote:

make btrfs qgroups show human readable sizes
using --human-readable option, example:

That's too long to type and the idea was to add all the long options
that force the specific unit base, ie. --kbytes/--mbytes/..., --raw,
--si and --iec. We can possibly make the human readable the default
because that's what I'd expect to see to have a quick overview and can
use the other options otherwise.

The geopt parser accepts short options if they're unique, so --kb or
even --k works as a very convenient shorcut for frequent commandline
use.

I have sent a mail for your advise of adding options. In that mail, I
asked whether I should use --human-readable and add --kbytes --mbytes ...
But you have not reply to me.

So you've sent a v2 where we can see if our ideas match or not and
continue from there. Timely replies are not always feasible, I get a lot
of mails. Patch iterations are normal, nothing new here.
Sorry to you because of my words. I didn't consider you have a lot of 
mails. I will take your advice and improve my patch.

This is my personal mail address.

So, your advise is add --kbytes --mbytes ... and make human-readable
default behaviour?

qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  299.58MiB299.58MiB400.00MiB0.00B1/1 ---
0/265299.58MiB16.00KiB 0.00B320.00MiB1/1 ---
0/266299.58MiB16.00KiB 350.00MiB0.00B--- ---
1/1  599.16MiB299.59MiB800.00MiB0.00B--- 0/5,0/265

The values should be also aligned to the right.

It is aligned to left before my patch. I just keep it.

Ok, take it as a hint for another patch.

I have combined them to one patch. Maybe I should seperate them.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 04/15] Btrfs: add ref_count and free function for btrfs_bio

2015-01-15 Thread Zhao Lei
Hi, David Sterba

* From: David Sterba [mailto:dste...@suse.cz]
 The cleanups look good in general, some minor nitpicks below.
 
 On Tue, Jan 13, 2015 at 08:34:37PM +0800, Zhaolei wrote:
  -   kfree(bbio);
  +   put_btrfs_bio(bbio);
 
 Please rename it to btrfs_put_bbio, this is more consistent with other
 *_put_* helpers and 'bbio' distinguishes btrfs_bio from regular 'bio'.
 
Good suggestion, I like these unified-format name.

 
   static void btrfs_end_bio(struct bio *bio, int err)
  diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
  index fb0e8c3..db195f0 100644
  --- a/fs/btrfs/volumes.h
  +++ b/fs/btrfs/volumes.h
  @@ -295,6 +295,7 @@ typedef void (btrfs_bio_end_io_t) (struct btrfs_bio 
  *bio, int err);
   #define BTRFS_BIO_ORIG_BIO_SUBMITTED   (1  0)
 
   struct btrfs_bio {
  +   atomic_t ref_count;
 
   atomic_t refs;
 
Ok.

  atomic_t stripes_pending;
  struct btrfs_fs_info *fs_info;
  bio_end_io_t *end_io;
  @@ -394,13 +395,8 @@ struct btrfs_balance_control {
 
   int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 u64 end, u64 *length);
  -
  -#define btrfs_bio_size(total_stripes, real_stripes)\
  -   (sizeof(struct btrfs_bio) + \
  -(sizeof(struct btrfs_bio_stripe) * (total_stripes)) +  \
  -(sizeof(int) * (real_stripes)) +   \
  -(sizeof(u64) * (real_stripes)))
  -
  +void get_btrfs_bio(struct btrfs_bio *bbio);
 
   btrfs_get_bbio
 
Thanks for your suggestion, I'll include above changes in v2.

Thanks
Zhaolei

  +void put_btrfs_bio(struct btrfs_bio *bbio);
   int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
  u64 logical, u64 *length,
  struct btrfs_bio **bbio_ret, int mirror_num);


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html