Re: [PATCH 3/3] btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace

2017-05-31 Thread Anand Jain




@@ -59,7 +59,7 @@ static struct list_head *zlib_alloc_workspace(void)

workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, 
MAX_MEM_LEVEL),
zlib_inflate_workspacesize());
-   workspace->strm.workspace = vmalloc(workspacesize);
+   workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL);
workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);


 Workspace->buf can be a kvmalloc as well as in lzo? or I don't know if
 there was any purpose for not doing it for zlib.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace

2017-05-31 Thread Anand Jain



On 05/31/17 23:41, David Sterba wrote:

As alloc_workspace is now protected by memalloc_nofs where needed,
we can switch the kmalloc to use GFP_KERNEL.

Signed-off-by: David Sterba 



Reviewed-by: Anand Jain 

Thanks, Anand


---
 fs/btrfs/lzo.c  | 2 +-
 fs/btrfs/zlib.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index a554856a5f8a..c556f3f3fbf0 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -51,7 +51,7 @@ static struct list_head *lzo_alloc_workspace(void)
 {
struct workspace *workspace;

-   workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
if (!workspace)
return ERR_PTR(-ENOMEM);

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index d5446e18bb59..c1db7572283b 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -53,14 +53,14 @@ static struct list_head *zlib_alloc_workspace(void)
struct workspace *workspace;
int workspacesize;

-   workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
if (!workspace)
return ERR_PTR(-ENOMEM);

workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, 
MAX_MEM_LEVEL),
zlib_inflate_workspacesize());
workspace->strm.workspace = vmalloc(workspacesize);
-   workspace->buf = kmalloc(PAGE_SIZE, GFP_NOFS);
+   workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!workspace->strm.workspace || !workspace->buf)
goto fail;



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: add memalloc_nofs protections around alloc_workspace callback

2017-05-31 Thread Anand Jain



On 05/31/17 23:41, David Sterba wrote:

The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.

To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.


 Reviewed-by: Anand Jain 

Thanks, Anand


Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ba511dd454d5..39cd164e5a62 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -757,6 +758,7 @@ static struct list_head *find_workspace(int type)
struct list_head *workspace;
int cpus = num_online_cpus();
int idx = type - 1;
+   unsigned nofs_flag;

struct list_head *idle_ws   = _comp_ws[idx].idle_ws;
spinlock_t *ws_lock = _comp_ws[idx].ws_lock;
@@ -786,7 +788,15 @@ static struct list_head *find_workspace(int type)
atomic_inc(total_ws);
spin_unlock(ws_lock);

+   /*
+* Allocation helpers call vmalloc that can't use GFP_NOFS, so we have
+* to turn it off here because we might get called from the restricted
+* context of btrfs_compress_bio/btrfs_compress_pages
+*/
+   nofs_flag = memalloc_nofs_save();
workspace = btrfs_compress_op[idx]->alloc_workspace();
+   memalloc_nofs_restore(nofs_flag);
+
if (IS_ERR(workspace)) {
atomic_dec(total_ws);
wake_up(ws_wait);





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data

2017-05-31 Thread Qu Wenruo



At 05/31/2017 09:52 PM, David Sterba wrote:

On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote:

Introduce a wrapper to recover raid56 data.

The logical is the same with kernel one, but with different interfaces,
since kernel ones cares the performance while in btrfs we don't care
that much.


Can you please rephrase this paragraph? I'll commit the patch as-is for
now but want replace the changelog with an improved one.




I'm not sure which direction I should change the paragraph to.

To explain more about the wrapper or remove unrelated difference between 
kernel and btrfs-progs part?


Or why we need the wrapper?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3.2 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-05-31 Thread Qu Wenruo



At 05/31/2017 10:30 PM, David Sterba wrote:

On Wed, May 31, 2017 at 08:31:35AM +0800, Qu Wenruo wrote:

Yes it's hard to find such deadlock especially when lockdep will not
detect it.

And this makes the advantage of using stack memory in v3 patch more obvious.

I didn't realize the extra possible deadlock when memory pressure is
high, and to make completely correct usage of GFP_ flags we should let
caller to choose its GFP_ flag, which will introduce more modification
and more possibility to cause problem.

So now I prefer the stack version a little more.


The difference is that the stack version will always consume the stack
at runtime.  The dynamic allocation will not, but we have to add error
handling and make sure we use right gfp flags. So it's runtime vs review
trade off, I choose to spend time on review.


OK, then I'll update the patchset to allow passing gfp flags for each
reservation.


You mean to add gfp flags to extent_changeset_alloc and update the
direct callers or to add gfp flags to the whole reservation codepath?


Yes, I was planning to do it.


I strongly prefer to use GFP_NOFS for now, although it's not ideal.


OK, then keep GFP_NOFS.
But I also want to know the reason why.

Is it just because we don't have good enough tool to detect possible 
deadlock caused by wrong GFP_* flags in write path?


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debugging Deadlocks?

2017-05-31 Thread Sargun Dhillon
On Wed, May 31, 2017 at 5:54 AM, David Sterba  wrote:
> On Tue, May 30, 2017 at 09:12:39AM -0700, Sargun Dhillon wrote:
>> We've been running BtrFS for a couple months now in production on
>> several clusters. We're running on Canonical's 4.8 kernel, and
>> currently, in the process of moving to our own patchset atop vanilla
>> 4.10+. I'm glad to say it's been a fairly good experience for us. Bar
>> some performance issues, it's been largely smooth sailing.
>
> Yay, thanks for the feedback.
>
>> There has been one class of persistent issues that has been plaguing
>> our cluster is deadlocks. We've seen a fair number of issues where
>> there are some number of background threads and user threads are in
>> the process of performing operations where some are waiting to start a
>> transaction, and at least one background thread or user thread is in
>> the process of committing a transaction. Unfortunately, these
>> situations are ending in deadlocks, where no threads are making
>> progress.
>
> In such situations, save stacks of all processes (/proc/PID/stack). I
> don't want to play terminology here, so by a deadlock I could understand
> a system that's making progress so slow that's effectively stuck. This
> could happen if the files are freamgented, so eg. traversing extents
> takes locks and has a lot of work before it unlocks. Add some extent
 > sharing and updating references, this adds some points where the threads
> just wait.
>
> The stacktraces could give an idea of what kind of hang it is.
>
We're saving a dump of the tasks currently running. A recent dump can
be found here: http://cwillu.com:8080/50.19.255.106/1. This is the
only snapshot I have from a node that's not making any progress.

We also see the other case, where tasks are not making progress very
quickly, and it causes the kernel hung task detector to kick in. This
happens pretty often, and it's difficult to catch when it's happening,
but the symptoms can be frustrating, including failed instance
healthchecks, poor performance, and high latency for interactive
services. Some of the traces we've gotten from the stuck task detector
include:
https://gist.github.com/sargun/9643c0c380d27a147ef3486e1d51dbdb
https://gist.github.com/sargun/8858263b8d04c8ab726738022725ec12


>> We've talked about a couple ideas internally, like adding the ability
>> to timeout transactions, abort commits or start_transactions which are
>> taking too long, and adding more debugging to get insights into the
>> state of the filesystem. Unfortunately, since our usage and knowledge
>> of BtrFS is still somewhat nascent, we're unsure of what is the right
>> investment.
>
> There's a kernel-wide hung task detection, but I think a similar
> mechanism around just the transaction commits would be useful, as a
> debugging option.
>
> There are number of ways how a transaction can be blocked though, so
> we'd need to choose the starting point. Extent-related locks, waiting
> for writes, other locks, the intenral transactional logic (and possibly
> more).
>
As a first step, it'd be nice to have the transaction wrapped in a
stack frame. We can then instrument it much more easily with off the
shelf tools like simple BPF-based kprobes / kretprobes, or ftraces,
rather than having to write a custom probe that's familiar with the
innards of the txn datastructure, and does its own accounting to keep
track of what's in flight.

I'll take a cut at something as simple as an in-memory list of
transactions which is periodically scanned for transactions which are
taking too long, and log whether they're stuck starting, commiting, or
in-flight and uncommitted.

>> I'm curious, are other people seeing deadlocks crop up in production
>> often? How are you going about debugging them, and are there any good
>> pieces of advice on avoiding these for production workloads?
>
> I have seen hangs with kernel 4.9 a while back triggered by a
> long-running iozone stress test, but 4.8 was not affected, and 4.10+
> worked fine again. I don't know if/which btrfs patches the 'canonical
> 4.8' kernel has, so this might not be related.
>
> As for deadlocks (double taken lock, lock inversion), I haven't seen
> them for a long time. The testing kernels run with lockdep, so we should
> be able to see them early. You could try to run turn lockdep on if the
> performance penalty is still acceptable for you.  But there are still
> cases that lockdep does not cover IIRC, due to the higher-level
> semantics of the various btrfs trees and locking of extent buffers.
For some of these use-cases, we can pretty easily recreate the pattern
on the machine. For others, it's more complicated to find out which
containers, and datasets were scheduled to be processed on the
machine. We've run some sanity, and stress tests, but we can rarely
get the filesystem to fall over in a predictable way in these tests
compared to some production workloads.
--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH v2] Btrfs: fix invalid extent maps due to hole punching

2017-05-31 Thread Liu Bo
On Sun, May 28, 2017 at 10:31:05PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> While punching a hole in a range that is not aligned with the sector size
> (currently the same as the page size) we can end up leaving an extent map
> in memory with a length that is smaller then the sector size, which is
> not expected and can lead to problems. This issue is easily detected
> after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of
> inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
> following for example:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
>   $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
>   $ xfs_io -c "fpunch 60K 90K" /mnt/foo
>   $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
>   $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
>   $ umount /mnt
> 
> After the unmount operation we can see several warnings emmitted due to
> underflows related to space reservation counters:
> 
> [ 2837.443299] [ cut here ]
> [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 
> btrfs_destroy_inode+0xe8/0x27e [btrfs]
> [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse 
> parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev 
> tpm button se
> rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
> libcrc32c crc32c_gene
> ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic 
> virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod 
> floppy
> [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: GW   
> 4.10.0-rc8-btrfs-next-43+ #1
> [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [ 2837.462379] Call Trace:
> [ 2837.462379]  dump_stack+0x68/0x92
> [ 2837.462379]  __warn+0xc2/0xdd
> [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
> [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
> [ 2837.462379]  destroy_inode+0x3d/0x55
> [ 2837.462379]  evict+0x177/0x17e
> [ 2837.462379]  dispose_list+0x50/0x71
> [ 2837.462379]  evict_inodes+0x132/0x141
> [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
> [ 2837.462379]  kill_anon_super+0x12/0x1c
> [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
> [ 2837.462379]  deactivate_locked_super+0x30/0x68
> [ 2837.462379]  deactivate_super+0x36/0x39
> [ 2837.462379]  cleanup_mnt+0x58/0x76
> [ 2837.462379]  __cleanup_mnt+0x12/0x14
> [ 2837.462379]  task_work_run+0x77/0x9b
> [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
> [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
> [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
> [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
> [ 2837.462379] RSP: 002b:7ffdd0d8de58 EFLAGS: 0246 ORIG_RAX: 
> 00a6
> [ 2837.462379] RAX:  RBX: 556f76a39060 RCX: 
> 7f3ef3e6b9a7
> [ 2837.462379] RDX: 0001 RSI:  RDI: 
> 556f76a3f910
> [ 2837.462379] RBP: 556f76a3f910 R08: 556f76a3e670 R09: 
> 0015
> [ 2837.462379] R10: 06b4 R11: 0246 R12: 
> 7f3ef436ce64
> [ 2837.462379] R13:  R14: 556f76a39240 R15: 
> 7ffdd0d8e0e0
> [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
> [ 2837.596256] [ cut here ]
> [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 
> btrfs_free_block_groups+0x246/0x3eb [btrfs]
> [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse 
> parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev 
> tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 
> raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
> raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod 
> cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring 
> virtio e1000 scsi_mod floppy
> [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: GW   
> 4.10.0-rc8-btrfs-next-43+ #1
> [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [ 2837.663359] Call Trace:
> [ 2837.663359]  dump_stack+0x68/0x92
> [ 2837.663359]  __warn+0xc2/0xdd
> [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
> [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
> [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
> [ 2837.663359]  ? evict_inodes+0x132/0x141
> [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
> [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
> [ 2837.663359]  kill_anon_super+0x12/0x1c
> [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
> [ 2837.663359]  deactivate_locked_super+0x30/0x68
> [ 2837.663359]  deactivate_super+0x36/0x39
> [ 2837.663359]  

Re: Debugging Deadlocks?

2017-05-31 Thread Adam Borowski
On Wed, May 31, 2017 at 06:47:09AM +, Duncan wrote:
> Sargun Dhillon posted on Tue, 30 May 2017 09:12:39 -0700 as excerpted:
> > currently, in the process of moving to our own patchset atop vanilla
> > 4.10+
> 
> Good for getting off kernel 4.8, as in mainline kernel terms that's only 
> a short-term stable release and support has now ended.

In fact, support for 4.10 has already ended too.

You either stick with LTS (currently 4.9) or keep upgrading to the
newest-and-greatest bleeding edge as soon as it's released.
For production machines, the choice is obvious.

-- 
Don't be racist.  White, amber or black, all beers should be judged based
solely on their merits.  Heck, even if occasionally a cider applies for a
beer's job, why not?
On the other hand, corpo lager is not a race.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hi

2017-05-31 Thread griestkrist
-- 
Please contact me in the my e-mail addres,(griestkrist1...@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: use GFP_KERNEL in init_ipath

2017-05-31 Thread David Sterba
Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.

Signed-off-by: David Sterba 
---
 fs/btrfs/backref.c | 10 +-
 fs/btrfs/ioctl.c   |  4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 24865da63d8f..f723c11bb763 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -16,7 +16,7 @@
  * Boston, MA 021110-1307, USA.
  */
 
-#include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -2305,7 +2305,7 @@ struct btrfs_data_container *init_data_container(u32 
total_bytes)
size_t alloc_bytes;
 
alloc_bytes = max_t(size_t, total_bytes, sizeof(*data));
-   data = vmalloc(alloc_bytes);
+   data = kvmalloc(alloc_bytes, GFP_KERNEL);
if (!data)
return ERR_PTR(-ENOMEM);
 
@@ -2339,9 +2339,9 @@ struct inode_fs_paths *init_ipath(s32 total_bytes, struct 
btrfs_root *fs_root,
if (IS_ERR(fspath))
return (void *)fspath;
 
-   ifp = kmalloc(sizeof(*ifp), GFP_NOFS);
+   ifp = kmalloc(sizeof(*ifp), GFP_KERNEL);
if (!ifp) {
-   vfree(fspath);
+   kvfree(fspath);
return ERR_PTR(-ENOMEM);
}
 
@@ -2356,6 +2356,6 @@ void free_ipath(struct inode_fs_paths *ipath)
 {
if (!ipath)
return;
-   vfree(ipath->fspath);
+   kvfree(ipath->fspath);
kfree(ipath);
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c9cdea8061bc..e4116f9248c2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -37,7 +37,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4588,7 +4588,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
 out:
btrfs_free_path(path);
-   vfree(inodes);
+   kvfree(inodes);
kfree(loi);
 
return ret;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] More vmalloc cleanups

2017-05-31 Thread David Sterba
All vmalloc/vzalloc calls should be addressed, in this or the currently pending
patches.

David Sterba (5):
  btrfs: replace opencoded kvzalloc with the helper
  btrfs: send: use kvmalloc in iterate_dir_item
  btrfs: scrub: add memalloc_nofs protection around init_ipath
  btrfs: use GFP_KERNEL in init_ipath
  btrfs: adjust includes after vmalloc removal

 fs/btrfs/backref.c | 10 +-
 fs/btrfs/check-integrity.c | 11 ---
 fs/btrfs/ctree.c   |  2 +-
 fs/btrfs/ioctl.c   |  4 ++--
 fs/btrfs/raid56.c  | 11 ---
 fs/btrfs/scrub.c   |  9 +
 fs/btrfs/send.c| 11 ---
 7 files changed, 29 insertions(+), 29 deletions(-)

-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: send: use kvmalloc in iterate_dir_item

2017-05-31 Thread David Sterba
We use a growing buffer for xattrs larger than a page size, at some
point vmalloc is unconditionally used for larger buffers. We can still
try to avoid it using the kvmalloc helper.

Signed-off-by: David Sterba 
---
 fs/btrfs/send.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 924b1d941b53..7416b17c0eac 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1083,7 +1083,7 @@ static int iterate_dir_item(struct btrfs_root *root, 
struct btrfs_path *path,
buf = tmp;
}
if (!buf) {
-   buf = vmalloc(buf_len);
+   buf = kvmalloc(buf_len, GFP_KERNEL);
if (!buf) {
ret = -ENOMEM;
goto out;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: scrub: add memalloc_nofs protection around init_ipath

2017-05-31 Thread David Sterba
init_ipath is called from a safe ioctl context and from scrub when
printing an error.  The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
  in the GFP_NOFS context, so this silently does GFP_KERNEL and might
  deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
  callees

Signed-off-by: David Sterba 
---
 fs/btrfs/scrub.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e99be644b19f..096e503e3ddc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "volumes.h"
 #include "disk-io.h"
@@ -733,6 +734,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, 
u64 root,
u32 nlink;
int ret;
int i;
+   unsigned nofs_flag;
struct extent_buffer *eb;
struct btrfs_inode_item *inode_item;
struct scrub_warning *swarn = warn_ctx;
@@ -771,7 +773,14 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, 
u64 root,
nlink = btrfs_inode_nlink(eb, inode_item);
btrfs_release_path(swarn->path);
 
+   /*
+* init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
+* uses GFP_NOFS in this context, so we keep it consistent but it does
+* not seem to be strictly necessary.
+*/
+   nofs_flag = memalloc_nofs_save();
ipath = init_ipath(4096, local_root, swarn->path);
+   memalloc_nofs_restore(nofs_flag);
if (IS_ERR(ipath)) {
ret = PTR_ERR(ipath);
ipath = NULL;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: replace opencoded kvzalloc with the helper

2017-05-31 Thread David Sterba
The logic of kmalloc and vmalloc fallback is open coded in
several places, we can now use the existing helper.

Signed-off-by: David Sterba 
---
 fs/btrfs/check-integrity.c | 11 ---
 fs/btrfs/raid56.c  | 11 ---
 fs/btrfs/send.c|  9 +++--
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 6cabc8acee2a..5f8006e4de9d 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -94,7 +94,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -2920,13 +2920,10 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
   fs_info->sectorsize, PAGE_SIZE);
return -1;
}
-   state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | 
__GFP_REPEAT);
+   state = kvzalloc(sizeof(*state), GFP_KERNEL);
if (!state) {
-   state = vzalloc(sizeof(*state));
-   if (!state) {
-   pr_info("btrfs check-integrity: vzalloc() failed!\n");
-   return -1;
-   }
+   pr_info("btrfs check-integrity: allocation failed!\n");
+   return -1;
}
 
if (!btrfsic_is_initialized) {
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d8ea0eb76325..d68af3c61b49 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -31,7 +31,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -218,12 +218,9 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info 
*info)
 * of a failing mount.
 */
table_size = sizeof(*table) + sizeof(*h) * num_entries;
-   table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-   if (!table) {
-   table = vzalloc(table_size);
-   if (!table)
-   return -ENOMEM;
-   }
+   table = kvzalloc(table_size, GFP_KERNEL);
+   if (!table)
+   return -ENOMEM;
 
spin_lock_init(>cache_lock);
INIT_LIST_HEAD(>stripe_cache);
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index e8185c83f667..924b1d941b53 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6389,13 +6389,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void 
__user *arg_)
 
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
 
-   sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
+   sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL);
if (!sctx->clone_roots) {
-   sctx->clone_roots = vzalloc(alloc_size);
-   if (!sctx->clone_roots) {
-   ret = -ENOMEM;
-   goto out;
-   }
+   ret = -ENOMEM;
+   goto out;
}
 
alloc_size = arg->clone_sources_count * sizeof(*arg->clone_sources);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: adjust includes after vmalloc removal

2017-05-31 Thread David Sterba
As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now
use the proper header that defines kvmalloc.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 6e1b02dd72d3..3f4daa9d6e2c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Update gfp flags in alloc_workspace

2017-05-31 Thread David Sterba
We can cheaply get rid of a few GFP_NOFS, as there are only two contexts,
initializaton (GFP_KERNEL is ok), and potentially a "NOFS", but with
memalloc_nofs it's made safe.

David Sterba (3):
  btrfs: add memalloc_nofs protections around alloc_workspace callback
  btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace
  btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace

 fs/btrfs/compression.c | 10 ++
 fs/btrfs/lzo.c | 16 
 fs/btrfs/zlib.c| 10 +-
 3 files changed, 23 insertions(+), 13 deletions(-)

-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: add memalloc_nofs protections around alloc_workspace callback

2017-05-31 Thread David Sterba
The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.

To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ba511dd454d5..39cd164e5a62 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -757,6 +758,7 @@ static struct list_head *find_workspace(int type)
struct list_head *workspace;
int cpus = num_online_cpus();
int idx = type - 1;
+   unsigned nofs_flag;
 
struct list_head *idle_ws   = _comp_ws[idx].idle_ws;
spinlock_t *ws_lock = _comp_ws[idx].ws_lock;
@@ -786,7 +788,15 @@ static struct list_head *find_workspace(int type)
atomic_inc(total_ws);
spin_unlock(ws_lock);
 
+   /*
+* Allocation helpers call vmalloc that can't use GFP_NOFS, so we have
+* to turn it off here because we might get called from the restricted
+* context of btrfs_compress_bio/btrfs_compress_pages
+*/
+   nofs_flag = memalloc_nofs_save();
workspace = btrfs_compress_op[idx]->alloc_workspace();
+   memalloc_nofs_restore(nofs_flag);
+
if (IS_ERR(workspace)) {
atomic_dec(total_ws);
wake_up(ws_wait);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace

2017-05-31 Thread David Sterba
As alloc_workspace is now protected by memalloc_nofs where needed,
we can switch the kmalloc to use GFP_KERNEL.

Signed-off-by: David Sterba 
---
 fs/btrfs/lzo.c  | 2 +-
 fs/btrfs/zlib.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index a554856a5f8a..c556f3f3fbf0 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -51,7 +51,7 @@ static struct list_head *lzo_alloc_workspace(void)
 {
struct workspace *workspace;
 
-   workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
if (!workspace)
return ERR_PTR(-ENOMEM);
 
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index d5446e18bb59..c1db7572283b 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -53,14 +53,14 @@ static struct list_head *zlib_alloc_workspace(void)
struct workspace *workspace;
int workspacesize;
 
-   workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
if (!workspace)
return ERR_PTR(-ENOMEM);
 
workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, 
MAX_MEM_LEVEL),
zlib_inflate_workspacesize());
workspace->strm.workspace = vmalloc(workspacesize);
-   workspace->buf = kmalloc(PAGE_SIZE, GFP_NOFS);
+   workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!workspace->strm.workspace || !workspace->buf)
goto fail;
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace

2017-05-31 Thread David Sterba
The compression workspace buffers are larger than a page so we use
vmalloc, unconditionally. This is not always necessary as there might be
contiguous memory available.

Let's use the kvmalloc helpers that will try kmalloc first and fallback
to vmalloc. For that they require GFP_KERNEL flags. As we now have the
alloc_workspace calls protected by memalloc_nofs in the critical
contexts, we can safely use GFP_KERNEL.

Signed-off-by: David Sterba 
---
 fs/btrfs/lzo.c  | 14 +++---
 fs/btrfs/zlib.c |  6 +++---
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index c556f3f3fbf0..cde13cce01a0 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -18,7 +18,7 @@
 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -41,9 +41,9 @@ static void lzo_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
 
-   vfree(workspace->buf);
-   vfree(workspace->cbuf);
-   vfree(workspace->mem);
+   kvfree(workspace->buf);
+   kvfree(workspace->cbuf);
+   kvfree(workspace->mem);
kfree(workspace);
 }
 
@@ -55,9 +55,9 @@ static struct list_head *lzo_alloc_workspace(void)
if (!workspace)
return ERR_PTR(-ENOMEM);
 
-   workspace->mem = vmalloc(LZO1X_MEM_COMPRESS);
-   workspace->buf = vmalloc(lzo1x_worst_compress(PAGE_SIZE));
-   workspace->cbuf = vmalloc(lzo1x_worst_compress(PAGE_SIZE));
+   workspace->mem = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+   workspace->buf = kvmalloc(lzo1x_worst_compress(PAGE_SIZE), GFP_KERNEL);
+   workspace->cbuf = kvmalloc(lzo1x_worst_compress(PAGE_SIZE), GFP_KERNEL);
if (!workspace->mem || !workspace->buf || !workspace->cbuf)
goto fail;
 
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index c1db7572283b..c248f9286366 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -43,7 +43,7 @@ static void zlib_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
 
-   vfree(workspace->strm.workspace);
+   kvfree(workspace->strm.workspace);
kfree(workspace->buf);
kfree(workspace);
 }
@@ -59,7 +59,7 @@ static struct list_head *zlib_alloc_workspace(void)
 
workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, 
MAX_MEM_LEVEL),
zlib_inflate_workspacesize());
-   workspace->strm.workspace = vmalloc(workspacesize);
+   workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL);
workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!workspace->strm.workspace || !workspace->buf)
goto fail;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3.2 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-05-31 Thread David Sterba
On Wed, May 31, 2017 at 08:31:35AM +0800, Qu Wenruo wrote:
> >> Yes it's hard to find such deadlock especially when lockdep will not
> >> detect it.
> >>
> >> And this makes the advantage of using stack memory in v3 patch more 
> >> obvious.
> >>
> >> I didn't realize the extra possible deadlock when memory pressure is
> >> high, and to make completely correct usage of GFP_ flags we should let
> >> caller to choose its GFP_ flag, which will introduce more modification
> >> and more possibility to cause problem.
> >>
> >> So now I prefer the stack version a little more.
> > 
> > The difference is that the stack version will always consume the stack
> > at runtime.  The dynamic allocation will not, but we have to add error
> > handling and make sure we use right gfp flags. So it's runtime vs review
> > trade off, I choose to spend time on review.
> 
> OK, then I'll update the patchset to allow passing gfp flags for each 
> reservation.

You mean to add gfp flags to extent_changeset_alloc and update the
direct callers or to add gfp flags to the whole reservation codepath?
I strongly prefer to use GFP_NOFS for now, although it's not ideal.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/20] Btrfs-progs offline scrub

2017-05-31 Thread David Sterba
On Tue, May 30, 2017 at 08:54:32PM +0200, David Sterba wrote:
> On Thu, May 25, 2017 at 02:21:45PM +0800, Qu Wenruo wrote:
> > For any one who wants to try it, it can be get from my repo:
> > https://github.com/adam900710/btrfs-progs/tree/offline_scrub
> > Qu Wenruo (20):
> >   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
> >   btrfs-progs: raid56: Introduce tables for RAID6 recovery
> >   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
> >   btrfs-progs: raid56: Allow raid6 to recover data and p

Patches 1-5 applied, with small fixups here and there, that took me more
time than I wanted. Yet I'm not sure if I should bother you with the
coding style things and grammar fixes. Maybe I should, it becomes too
distracting namely in the rest of the patch series.

> >   btrfs-progs: Introduce wrapper to recover raid56 data
> >   btrfs-progs: Introduce new btrfs_map_block function which returns more
> > unified result.
> >   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
> >   btrfs-progs: csum: Introduce function to read out data csums
> 
> I'm about to start merging this patches, in parts. First the patches 1-8
> as they're independent and not intrusive.
> 
> >   btrfs-progs: scrub: Introduce structures to support offline scrub for
> > RAID56
> >   btrfs-progs: scrub: Introduce functions to scrub mirror based tree
> > block
> >   btrfs-progs: scrub: Introduce functions to scrub mirror based data
> > blocks
> >   btrfs-progs: scrub: Introduce function to scrub one mirror-based
> > extent
> >   btrfs-progs: scrub: Introduce function to scrub one data stripe
> >   btrfs-progs: scrub: Introduce function to verify parities
> >   btrfs-progs: extent-tree: Introduce function to check if there is any
> > extent in given range.
> >   btrfs-progs: scrub: Introduce function to recover data parity
> >   btrfs-progs: scrub: Introduce helper to write a full stripe
> >   btrfs-progs: scrub: Introduce a function to scrub one full stripe
> >   btrfs-progs: scrub: Introduce function to check a whole block group
> >   btrfs-progs: scrub: Introduce offline scrub function
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data

2017-05-31 Thread David Sterba
On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote:
> Introduce a wrapper to recover raid56 data.
> 
> The logical is the same with kernel one, but with different interfaces,
> since kernel ones cares the performance while in btrfs we don't care
> that much.

Can you please rephrase this paragraph? I'll commit the patch as-is for
now but want replace the changelog with an improved one. 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/20] btrfs-progs: raid56: Introduce raid56 header for later recovery usage

2017-05-31 Thread David Sterba
On Thu, May 25, 2017 at 02:21:46PM +0800, Qu Wenruo wrote:
> @@ -0,0 +1,28 @@
> +/*
> + * Copyright (C) 2017 Fujitsu.  All rights reserved.

Please use just plain GPL v2 header in newly created files. The history
of who modified what in the given file is stored in git. I can't simply
remove the copyright notice from existing files without checking first,
but I can stop addding more.

> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#ifndef _BTRFS_PROGS_RAID56_H
> +#define _BTRFS_PROGS_RAID56_H

The pattern for the ifndef macro name is:

"__BTRFS_" + descriptive header name + "_H__"

just look to other files.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debugging Deadlocks?

2017-05-31 Thread David Sterba
On Tue, May 30, 2017 at 09:12:39AM -0700, Sargun Dhillon wrote:
> We've been running BtrFS for a couple months now in production on
> several clusters. We're running on Canonical's 4.8 kernel, and
> currently, in the process of moving to our own patchset atop vanilla
> 4.10+. I'm glad to say it's been a fairly good experience for us. Bar
> some performance issues, it's been largely smooth sailing.

Yay, thanks for the feedback.

> There has been one class of persistent issues that has been plaguing
> our cluster is deadlocks. We've seen a fair number of issues where
> there are some number of background threads and user threads are in
> the process of performing operations where some are waiting to start a
> transaction, and at least one background thread or user thread is in
> the process of committing a transaction. Unfortunately, these
> situations are ending in deadlocks, where no threads are making
> progress.

In such situations, save stacks of all processes (/proc/PID/stack). I
don't want to play terminology here, so by a deadlock I could understand
a system that's making progress so slow that's effectively stuck. This
could happen if the files are freamgented, so eg. traversing extents
takes locks and has a lot of work before it unlocks. Add some extent
sharing and updating references, this adds some points where the threads
just wait.

The stacktraces could give an idea of what kind of hang it is.

> We've talked about a couple ideas internally, like adding the ability
> to timeout transactions, abort commits or start_transactions which are
> taking too long, and adding more debugging to get insights into the
> state of the filesystem. Unfortunately, since our usage and knowledge
> of BtrFS is still somewhat nascent, we're unsure of what is the right
> investment.

There's a kernel-wide hung task detection, but I think a similar
mechanism around just the transaction commits would be useful, as a
debugging option.

There are number of ways how a transaction can be blocked though, so
we'd need to choose the starting point. Extent-related locks, waiting
for writes, other locks, the intenral transactional logic (and possibly
more).

> I'm curious, are other people seeing deadlocks crop up in production
> often? How are you going about debugging them, and are there any good
> pieces of advice on avoiding these for production workloads?

I have seen hangs with kernel 4.9 a while back triggered by a
long-running iozone stress test, but 4.8 was not affected, and 4.10+
worked fine again. I don't know if/which btrfs patches the 'canonical
4.8' kernel has, so this might not be related.

As for deadlocks (double taken lock, lock inversion), I haven't seen
them for a long time. The testing kernels run with lockdep, so we should
be able to see them early. You could try to run turn lockdep on if the
performance penalty is still acceptable for you.  But there are still
cases that lockdep does not cover IIRC, due to the higher-level
semantics of the various btrfs trees and locking of extent buffers.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs wiki account

2017-05-31 Thread David Sterba
On Wed, May 31, 2017 at 01:46:17PM +0200, David Sterba wrote:
> On Wed, May 31, 2017 at 07:35:24PM +0800, chou Dai wrote:
> > Hi,I wanted to add some content and edit some content to wiki,but I
> > made a registration two week ago(Account Name:Dai chou) . Now, I still
> > cannot register, through registration request expired already. And
> > Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list
> > with an appropriate subject to get attention.". So, I come to here.
> > thanks.
> 
> Account approved.
> 
> There are 2 wiki admins who can approve the accounts.

For the record, I see in the user creating log that Andre responds to
the requests within a reasonable time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs wiki account

2017-05-31 Thread David Sterba
On Wed, May 31, 2017 at 07:35:24PM +0800, chou Dai wrote:
> Hi,I wanted to add some content and edit some content to wiki,but I
> made a registration two week ago(Account Name:Dai chou) . Now, I still
> cannot register, through registration request expired already. And
> Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list
> with an appropriate subject to get attention.". So, I come to here.
> thanks.

Account approved.

There are 2 wiki admins who can approve the accounts. I used to get
emails on new requests but this does not seem to work now. I'd have to
manually check if there are pending requests, so asking in the
mailinglist should be the last option.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs wiki account

2017-05-31 Thread chou Dai
Hi,I wanted to add some content and edit some content to wiki,but I
made a registration two week ago(Account Name:Dai chou) . Now, I still
cannot register, through registration request expired already. And
Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list
with an appropriate subject to get attention.". So, I come to here.
thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix invalid extent maps due to hole punching

2017-05-31 Thread fdmanana
From: Filipe Manana 

While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size, which is
not expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
  $ xfs_io -c "fpunch 60K 90K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
  $ umount /mnt

After the unmount operation we can see several warnings emmitted due to
underflows related to space reservation counters:

[ 2837.443299] [ cut here ]
[ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 
btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse 
parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev 
tpm button se
rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c crc32c_gene
ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic 
virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: GW   
4.10.0-rc8-btrfs-next-43+ #1
[ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.462379] Call Trace:
[ 2837.462379]  dump_stack+0x68/0x92
[ 2837.462379]  __warn+0xc2/0xdd
[ 2837.462379]  warn_slowpath_null+0x1d/0x1f
[ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.462379]  destroy_inode+0x3d/0x55
[ 2837.462379]  evict+0x177/0x17e
[ 2837.462379]  dispose_list+0x50/0x71
[ 2837.462379]  evict_inodes+0x132/0x141
[ 2837.462379]  generic_shutdown_super+0x3f/0xeb
[ 2837.462379]  kill_anon_super+0x12/0x1c
[ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.462379]  deactivate_locked_super+0x30/0x68
[ 2837.462379]  deactivate_super+0x36/0x39
[ 2837.462379]  cleanup_mnt+0x58/0x76
[ 2837.462379]  __cleanup_mnt+0x12/0x14
[ 2837.462379]  task_work_run+0x77/0x9b
[ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
[ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.462379] RSP: 002b:7ffdd0d8de58 EFLAGS: 0246 ORIG_RAX: 
00a6
[ 2837.462379] RAX:  RBX: 556f76a39060 RCX: 7f3ef3e6b9a7
[ 2837.462379] RDX: 0001 RSI:  RDI: 556f76a3f910
[ 2837.462379] RBP: 556f76a3f910 R08: 556f76a3e670 R09: 0015
[ 2837.462379] R10: 06b4 R11: 0246 R12: 7f3ef436ce64
[ 2837.462379] R13:  R14: 556f76a39240 R15: 7ffdd0d8e0e0
[ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
[ 2837.596256] [ cut here ]
[ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 
btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse 
parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev 
tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom 
sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio 
e1000 scsi_mod floppy
[ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: GW   
4.10.0-rc8-btrfs-next-43+ #1
[ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.663359] Call Trace:
[ 2837.663359]  dump_stack+0x68/0x92
[ 2837.663359]  __warn+0xc2/0xdd
[ 2837.663359]  warn_slowpath_null+0x1d/0x1f
[ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.663359]  ? evict_inodes+0x132/0x141
[ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.663359]  generic_shutdown_super+0x6a/0xeb
[ 2837.663359]  kill_anon_super+0x12/0x1c
[ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.663359]  deactivate_locked_super+0x30/0x68
[ 2837.663359]  deactivate_super+0x36/0x39
[ 2837.663359]  cleanup_mnt+0x58/0x76
[ 2837.663359]  __cleanup_mnt+0x12/0x14
[ 2837.663359]  task_work_run+0x77/0x9b
[ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
[ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.663359] RIP: 

Re: [PATCH] Btrfs: fix invalid extent maps due to hole punching

2017-05-31 Thread Filipe Manana
On Tue, May 30, 2017 at 9:50 PM, Omar Sandoval  wrote:
> On Sun, May 28, 2017 at 05:31:53PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>
> [snip]
>
> Hey, Filipe,
>
> I saw this warning and tried to apply your patch, but it doesn't apply
> cleanly (seems to conflict with 9986277e0e4c ("Btrfs: handle only
> applicable errors returned by btrfs_get_extent")).

Yes, it was based on an older branch.
I'll rebase it. Thanks.

>
>> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a 
>> already existed hole.")
>> Cc: 
>> Signed-off-by: Filipe Manana 
>> ---
>>  fs/btrfs/file.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index f7d022bc7998..2645d820422c 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -2390,10 +2390,12 @@ static int fill_holes(struct btrfs_trans_handle 
>> *trans,
>>   */
>>  static int find_first_non_hole(struct inode *inode, u64 *start, u64 *len)
>>  {
>> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>   struct extent_map *em;
>>   int ret = 0;
>>
>> - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, *start, *len, 0);
>> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, *start,
>> +   round_up(*len, fs_info->sectorsize), 0);
>>   if (IS_ERR_OR_NULL(em)) {
>>   if (!em)
>>   ret = -ENOMEM;
>> --
>> 2.11.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] xfs: nowait aio support

2017-05-31 Thread Jan Kara
On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote:
> > Btw, can you write a small blurb up for the man page to document these
> > Ñ•emantics in man-page like language?
> > 
> 
> Yes, but which man page would it belong to?
> Should it be a subsection of errors in io_getevents/io_submit. We don't
> want to add ERRORS to io_getevents() because it would be the return
> value of the io_getevents call, and not the ones in the iocb structure.
> Should it be a new man page, say for iocb(7/8)?

I think you should extend the manpage for io_submit(8). There you can add
definition of struct iocb in 'DESCRIPTION' section explaining at least the
most common fields. You can also explain there which flags can be passed
and what are they intended to do.

You can also expand EAGAIN error description to specifically mention that
in case of NOWAIT aio EAGAIN can be returned if io submission would block.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] btrfs: qgroup: Fix hang when using inode_cache and qgroup

2017-05-31 Thread Qu Wenruo
Commit 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
is causing hang, with the following backtrace:

Call Trace:
 __schedule+0x374/0xaf0
 schedule+0x3d/0x90
 wait_for_commit+0x4a/0x80 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_commit_transaction+0xe0/0xa10 [btrfs]  <<< Here
 ? start_transaction+0xad/0x510 [btrfs]
 qgroup_reserve+0x1f0/0x350 [btrfs]
 btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs]
 ? _raw_spin_unlock+0x27/0x40
 btrfs_check_data_free_space+0x6d/0xb0 [btrfs]
 btrfs_delalloc_reserve_space+0x25/0x70 [btrfs]
 btrfs_save_ino_cache+0x402/0x650 [btrfs]
 commit_fs_roots+0xb7/0x170 [btrfs]
 btrfs_commit_transaction+0x425/0xa10 [btrfs] <<< And here
 qgroup_reserve+0x1f0/0x350 [btrfs]
 btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs]
 ? _raw_spin_unlock+0x27/0x40
 btrfs_check_data_free_space+0x6d/0xb0 [btrfs]
 btrfs_delalloc_reserve_space+0x25/0x70 [btrfs]
 btrfs_direct_IO+0x1c5/0x3b0 [btrfs]
 generic_file_direct_write+0xab/0x150
 btrfs_file_write_iter+0x243/0x530 [btrfs]
 __vfs_write+0xc9/0x120
 vfs_write+0xcb/0x1f0
 SyS_pwrite64+0x79/0x90
 entry_SYSCALL_64_fastpath+0x18/0xad

The problem is that, inode_cache will be written in commit_fs_roots(),
which is called in btrfs_commit_transaction().

And when it fails to reserve enough data space, qgroup_reserve() will
try to call btrfs_commit_transaction() again, then we are waiting for
ourselves.

The patch will introduce can_retry parameter for qgroup_reserve(),
allowing related callers to avoid deadly commit transaction deadlock.

Now for space cache inode, we will not allow qgroup retry, so it will
not cause deadlock.

Fixes: 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
Cc: Goldwyn Rodrigues 
Signed-off-by: Qu Wenruo 
---
Commit  48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT")
is not only causing such deadlock, but also screwing up qgroup reserved
space for even generic test cases.

I'm afraid we may need to revert that commit if we can't find a good way
to fix the newly caused qgroup meta reserved space underflow.
(Unlike old bug which is qgroup data reserved space underflow, this time
the commit is causing new metadata space underflow).
---
 fs/btrfs/extent-tree.c |  7 +--
 fs/btrfs/qgroup.c  | 15 ++-
 fs/btrfs/qgroup.h  |  2 +-
 fs/btrfs/transaction.c |  2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e390451c72e6..15e7fcf3a7ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5817,7 +5817,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
/* One for parent inode, two for dir entries */
num_bytes = 3 * fs_info->nodesize;
-   ret = btrfs_qgroup_reserve_meta(root, num_bytes, true);
+   ret = btrfs_qgroup_reserve_meta(root, num_bytes, true, true);
if (ret)
return ret;
} else {
@@ -5945,6 +5945,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
u64 to_free = 0;
unsigned dropped;
bool release_extra = false;
+   bool can_qgroup_retry = true;
 
/* If we are a free space inode we need to not flush since we will be in
 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5957,6 +5958,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
if (btrfs_is_free_space_inode(inode)) {
flush = BTRFS_RESERVE_NO_FLUSH;
delalloc_lock = false;
+   can_qgroup_retry = false;
} else if (current->journal_info) {
flush = BTRFS_RESERVE_FLUSH_LIMIT;
}
@@ -5987,7 +5989,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
ret = btrfs_qgroup_reserve_meta(root,
-   nr_extents * fs_info->nodesize, true);
+   nr_extents * fs_info->nodesize, true,
+   can_qgroup_retry);
if (ret)
goto out_fail;
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index deffbeb74a0b..3859acb03ae9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2322,7 +2322,8 @@ static bool qgroup_check_limits(const struct btrfs_qgroup 
*qg, u64 num_bytes)
return true;
 }
 
-static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce)
+static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce,
+ bool can_retry)
 {
struct btrfs_root *quota_root;
struct btrfs_qgroup *qgroup;
@@ -2364,7 +2365,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool 

Re: Debugging Deadlocks?

2017-05-31 Thread Duncan
Sargun Dhillon posted on Tue, 30 May 2017 09:12:39 -0700 as excerpted:

> We've been running BtrFS for a couple months now in production on
> several clusters. We're running on Canonical's 4.8 kernel, and
> currently, in the process of moving to our own patchset atop vanilla
> 4.10+. I'm glad to say it's been a fairly good experience for us. Bar
> some performance issues, it's been largely smooth sailing.
> 
> There has been one class of persistent issues that has been plaguing our
> cluster is deadlocks.

Being just a list regular and btrfs (personal) user, not a dev or big-
time production user, I can't say I've seen a deadlocks problem either 
here or reported in significant numbers on-list, but beyond that I can't 
help there.

I'm replying, however, regarding your kernel choices.

Good for getting off kernel 4.8, as in mainline kernel terms that's only 
a short-term stable release and support has now ended.

But I'm slightly concerned with your kernel 4.10+ choice on production 
clusters.  4.9 is the most recent mainline and therefore btrfs LTS kernel 
series, and as such, what I was expecting.

Now don't get me wrong, 4.10+ is appropriate ATM as well, and if you're 
planning to stay current, within the 2-latest-current-kernel-cycles list 
recommendation, I'd consider it preferred.

However, most large-scale production deployments tend to prefer a 
somewhat slower upgrade cycle than that, in which case 4.9 is preferred 
as the latest mainline LTS series.

As far as LTS series go, this list tries to support the latest two LTS 
series, as it does the latest two current stable series.  While that's 
rather shorter than the LTS series support in general, it's in keeping 
with the fact that btrfs remains still stabilizing and as such under 
heavy development, tho it's far more stable than it was back in the 
kernel 3.x or early 4.x era.

At present that means 4.9 and the previous 4.4, altho in practice 4.4 was 
long enough ago that we prefer 4.9 unless there's some definite reason 
it's not going to work for you.

But you're not talking as old as 4.4 in any case, so it's a question of 
4.9 LTS and staying with that series for awhile, or 4.10+, but upgrading 
every 10 weeks or so as a new kernel series is released and the second-
back, now 4.10 as 4.11 is the newest, becomes the third back and thus 
slips out of both mainline stable release and btrfs list primary support 
range.

If you're comfortable with a ten-week upgrade cycle on the scale you're 
running in production, then by all means, go 4.10 or 4.11 at this point 
and do the upgrades, as that's preferable here for those where it's 
acceptable, but if not, then I'd strongly recommend the 4.9 LTS series 
for now, and upgrading LTS kernel series once a year or whatever, after 
the next LTS series comes out and has had a release or two to shake out 
the early bugs.

OTOH if there's something you really need 4.10 for but would otherwise 
prefer LTS, then yes, go current now and try to do the 10-week cycle, 
until the next LTS, then if desired stick with it and drop back to annual 
or whatever LTS series upgrades.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html