[Cluster-devel] [GFS2 PATCH] GFS2: Made logd daemon take into account log demand

2017-01-05 Thread Bob Peterson
Hi,

Before this patch, the logd daemon only tried to flush things when
the log blocks pinned exceeded a certain threshold. But when we're
deleting very large files, it may require a huge number of journal
blocks, and that, in turn, may exceed the threshold. This patch
factors that into account.

Signed-off-by: Bob Peterson 
---
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a6a3389..00d8dc2 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -794,6 +794,7 @@ struct gfs2_sbd {
atomic_t sd_log_thresh1;
atomic_t sd_log_thresh2;
atomic_t sd_log_blks_free;
+   atomic_t sd_log_blks_needed;
wait_queue_head_t sd_log_waitq;
wait_queue_head_t sd_logd_waitq;
 
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index e58ccef0..4df349c 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -349,6 +349,7 @@ int gfs2_log_reserve(struct gfs2_sbd *sdp, unsigned int 
blks)
if (gfs2_assert_warn(sdp, blks) ||
gfs2_assert_warn(sdp, blks <= sdp->sd_jdesc->jd_blocks))
return -EINVAL;
+   atomic_add(blks, >sd_log_blks_needed);
 retry:
free_blocks = atomic_read(>sd_log_blks_free);
if (unlikely(free_blocks <= wanted)) {
@@ -370,6 +371,7 @@ retry:
wake_up(>sd_reserving_log_wait);
goto retry;
}
+   atomic_sub(blks, >sd_log_blks_needed);
trace_gfs2_log_blocks(sdp, -blks);
 
/*
@@ -891,13 +893,16 @@ void gfs2_log_shutdown(struct gfs2_sbd *sdp)
 
 static inline int gfs2_jrnl_flush_reqd(struct gfs2_sbd *sdp)
 {
-   return (atomic_read(>sd_log_pinned) >= 
atomic_read(>sd_log_thresh1));
+   return (atomic_read(>sd_log_pinned) +
+   atomic_read(>sd_log_blks_needed) >=
+   atomic_read(>sd_log_thresh1));
 }
 
 static inline int gfs2_ail_flush_reqd(struct gfs2_sbd *sdp)
 {
unsigned int used_blocks = sdp->sd_jdesc->jd_blocks - 
atomic_read(>sd_log_blks_free);
-   return used_blocks >= atomic_read(>sd_log_thresh2);
+   return used_blocks + atomic_read(>sd_log_blks_needed) >=
+   atomic_read(>sd_log_thresh2);
 }
 
 /**
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index ff72ac6..86281a9 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -683,6 +683,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
goto fail_jindex;
}
 
+   atomic_set(>sd_log_blks_needed, 0);
if (sdp->sd_args.ar_spectator) {
sdp->sd_jdesc = gfs2_jdesc_find(sdp, 0);
atomic_set(>sd_log_blks_free, sdp->sd_jdesc->jd_blocks);



[Cluster-devel] [GFS2 PATCH v3] GFS2: Limit number of transaction blocks requested for truncates

2017-01-05 Thread Bob Peterson
Hi,

On 16 December and 19 December, I posted predecessor patches for
this problem. I've subsequently uncovered problems during testing,
so this is a third version of the patch.

It's not that the patch was buggy: the 3/4ths threshold was okay,
in theory, but it uncovered another problem related to when the
logd daemon thinks (or doesn't think) a journal flush is required.
Function gfs2_ail_flush_reqd uses a journal threshold value of
4/5ths of the journal. If we use a threshold of 3/4ths, we can get
into situations where the logd daemon doesn't think a flush is
necessary even though we may need the log flushed in order to
continue working. By using the same threshold, the log keeps
moving properly in most cases.

This version 3, then, eliminates the value and calculation of the
predecessor in favor of using the same log threshold already used
by gfs2_ail_flush_reqd. This avoids the problem in most cases,
but does not eliminate it entirely. I've written a follow-up patch
for that which I'll post soon.
---
GFS2: Limit number of transaction blocks requested for truncates

This patch limits the number of transaction blocks requested during
file truncates. If we have very large multi-terabyte files, and want
to delete or truncate them, they might span so many resource groups
that we overflow the journal blocks, and cause an assert failure.
By limiting the number of blocks in the transaction, we prevent this
overflow and give other running processes time to do transactions.

The limiting factor I chose is sd_log_thresh2 which is currently
set to 4/5ths of the journal. This same ratio is used in function
gfs2_ail_flush_reqd to determine when a log flush is required.
If we make the maximum value less than this, we can get into a
infinite hang whereby the log stops moving because the number of
used blocks is less than the threshold and the iterative loop
needs more, but since we're under the threshold, the log daemon
never starts any IO on the log.

Signed-off-by: Bob Peterson 
---
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 645721f..cbad0ef 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -720,6 +720,7 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
 {
struct gfs2_sbd *sdp = GFS2_SB(>i_inode);
struct gfs2_rgrp_list rlist;
+   struct gfs2_trans *tr;
u64 bn, bstart;
u32 blen, btotal;
__be64 *p;
@@ -728,6 +729,7 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
unsigned int revokes = 0;
int x;
int error;
+   int jblocks_rqsted;
 
error = gfs2_rindex_update(sdp);
if (error)
@@ -791,12 +793,17 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
if (gfs2_rs_active(>i_res)) /* needs to be done with the rgrp glock 
held */
gfs2_rs_deltree(>i_res);
 
-   error = gfs2_trans_begin(sdp, rg_blocks + RES_DINODE +
-RES_INDIRECT + RES_STATFS + RES_QUOTA,
-revokes);
+restart:
+   jblocks_rqsted = rg_blocks + RES_DINODE +
+   RES_INDIRECT + RES_STATFS + RES_QUOTA +
+   gfs2_struct2blk(sdp, revokes, sizeof(u64));
+   if (jblocks_rqsted > atomic_read(>sd_log_thresh2))
+   jblocks_rqsted = atomic_read(>sd_log_thresh2);
+   error = gfs2_trans_begin(sdp, jblocks_rqsted, revokes);
if (error)
goto out_rg_gunlock;
 
+   tr = current->journal_info;
down_write(>i_rw_mutex);
 
gfs2_trans_add_meta(ip->i_gl, dibh);
@@ -810,6 +817,16 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
if (!*p)
continue;
 
+   /* check for max reasonable journal transaction blocks */
+   if (tr->tr_num_buf_new + RES_STATFS +
+   RES_QUOTA >= atomic_read(>sd_log_thresh2)) {
+   if (rg_blocks >= tr->tr_num_buf_new)
+   rg_blocks -= tr->tr_num_buf_new;
+   else
+   rg_blocks = 0;
+   break;
+   }
+
bn = be64_to_cpu(*p);
 
if (bstart + blen == bn)
@@ -827,6 +844,9 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
*p = 0;
gfs2_add_inode_blocks(>i_inode, -1);
}
+   if (p == bottom)
+   rg_blocks = 0;
+
if (bstart) {
__gfs2_free_blocks(ip, bstart, blen, metadata);
btotal += blen;
@@ -844,6 +864,9 @@ static int do_strip(struct gfs2_inode *ip, struct 
buffer_head *dibh,
 
gfs2_trans_end(sdp);
 
+   if (rg_blocks)
+   goto restart;
+
 out_rg_gunlock:
gfs2_glock_dq_m(rlist.rl_rgrps, rlist.rl_ghs);
 out_rlist:



[Cluster-devel] [PATCH v1 00/54] block: support multipage bvec

2017-01-05 Thread Ming Lei
Hi,

This patchset brings multipage bvec into block layer. Basic
xfstests(-a auto) over virtio-blk/virtio-scsi have been run
and no regression is found, so it should be good enough
to show the approach now, and any comments are welcome!

1) what is multipage bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold
multiple pages which are physically contiguous instead
of one single page used in linux kernel for long time.

2) why is multipage bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and 
at the same time huge page, transparent huge page and
memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O.
On the other hand, from block layer's view, it isn't
necessary to store intermediate pages into bvec, and
it is enough to just store the physicallly contiguous
'segment'.

Also huge pages are being brought to filesystem[2], we
can do IO a hugepage a time[3], requires that one bio can
transfer at least one huge page one time. Turns out it isn't
flexiable to change BIO_MAX_PAGES simply[3]. Multipage bvec
can fit in this case very well.

With multipage bvec:

- bio size can be increased and it should improve some
high-bandwidth IO case in theory[4].

- Inside block layer, both bio splitting and sg map can
become more efficient than before by just traversing the
physically contiguous 'segment' instead of each page.

- there is possibility in future to improve memory footprint
of bvecs usage. 

3) how is multipage bvec implemented in this patchset?

The 1st 9 patches comment on some special cases. As we saw,
most of cases are found as safe for multipage bvec,
only fs/buffer, MD and btrfs need to deal with. Both fs/buffer
and btrfs are dealt with in the following patches based on some
new block APIs for multipage bvec. 

Given a little more work is involved to cleanup MD, this patchset
introduces QUEUE_FLAG_NO_MP for them, and this component can still
see/use singlepage bvec. In the future, once the cleanup is done, the
flag can be killed.

The 2nd part(23 ~ 54) implements multipage bvec in block:

- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec

- bio_for_each_segment_all() changes
this helper pass pointer of each bvec directly to user, and
it has to be changed. Two new helpers(bio_for_each_segment_all_sp()
and bio_for_each_segment_all_mp()) are introduced. 

Also convert current bio_for_each_segment_all() into the
above two.

- bio_clone() changes
At default bio_clone still clones one new bio in multipage bvec
way. Also single page version of bio_clone() is introduced
for some special cases, such as only single page bvec is used
for the new cloned bio(bio bounce, ...)

- btrfs cleanup
just three patches for avoiding direct access to bvec table.

These patches can be found in the following git tree:

https://github.com/ming1/linux/commits/mp-bvec-0.6-v4.10-rc

Thanks Christoph for looking at the early version and providing
very good suggestions, such as: introduce bio_init_with_vec_table(),
remove another unnecessary helpers for cleanup and so on.

TODO:
- cleanup direct access to bvec table for MD

V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0

[1], http://marc.info/?l=linux-kernel=141680246629547=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=14773544711=1=2
[4], http://marc.info/?l=linux-mm=147745525801433=2


Ming Lei (54):
  block: drbd: comment on direct access bvec table
  block: loop: comment on direct access to bvec table
  kernel/power/swap.c: comment on direct access to bvec table
  mm: page_io.c: comment on direct access to bvec table
  fs/buffer: comment on direct access to bvec table
  f2fs: f2fs_read_end_io: comment on direct access to bvec table
  bcache: comment on direct access to bvec table
  block: comment on bio_alloc_pages()
  block: comment on bio_iov_iter_get_pages()
  block: introduce flag QUEUE_FLAG_NO_MP
  md: set NO_MP for request queue of md
  dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
  block: comments on bio_for_each_segment[_all]
  block: introduce multipage/single page bvec helpers
  block: implement sp version of bvec iterator helpers
  block: introduce bio_for_each_segment_mp()
  block: introduce bio_clone_sp()
  bvec_iter: introduce BVEC_ITER_ALL_INIT
  block: bounce: avoid direct access to bvec table
  block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq
  block: introduce bio_can_convert_to_sp()
  block: bounce: convert multipage bvecs into singlepage
  bcache: handle bio_clone() & bvec updating 

Re: [Cluster-devel] [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2017-01-05 Thread Brian Foster
On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> ---
> From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 15 Dec 2016 13:06:43 +0100
> Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
>  lockdep false positives
> 
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
> 

Hi Michal,

I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
the original problem. FWIW, there was another KM_NOFS instance added by
that commit in xlog_cil_prepare_log_vecs() that is now in
xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
original issue still applies..?

Brian

> Changes since v1
> - check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON
>   as per Mike Galbraith
> 
> Signed-off-by: Michal Hocko 
> ---
>  fs/xfs/kmem.h| 6 +-
>  fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
>  fs/xfs/xfs_buf.c | 2 +-
>  fs/xfs/xfs_dir2_readdir.c| 2 +-
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d5d634ef1f7f 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
>  #define KM_NOFS  ((__force xfs_km_flags_t)0x0004u)
>  #define KM_MAYFAIL   ((__force xfs_km_flags_t)0x0008u)
>  #define KM_ZERO  ((__force xfs_km_flags_t)0x0010u)
> +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u)
>  
>  /*
>   * We use a special process flag to avoid recursive callbacks into
> @@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  {
>   gfp_t   lflags;
>  
> - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> + BUG_ON(flags & 
> ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>  
>   if (flags & KM_NOSLEEP) {
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
> @@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   if (flags & KM_ZERO)
>   lflags |= __GFP_ZERO;
>  
> + if (flags & KM_NOLOCKDEP)
> + lflags |= __GFP_NOLOCKDEP;
> +
>   return lflags;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index f2dc1a950c85..b8b5f6914863 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(
>  
>   if (nirecs > 1) {
>   map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
> -   KM_SLEEP | KM_NOFS);
> +   KM_SLEEP | KM_NOLOCKDEP);
>   if (!map)
>   return -ENOMEM;
>   *mapp = map;
> @@ -2488,7 +2488,7 @@ xfs_dabuf_map(
>*/
>   if (nfsb != 1)
>   irecs = kmem_zalloc(sizeof(irec) * nfsb,
> - KM_SLEEP | KM_NOFS);
> + KM_SLEEP | KM_NOLOCKDEP);
>  
>   nirecs = nfsb;
>   error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..f31ae592dcae 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
>  {
>   xfs_buftarg_t   *btp;
>  
> - btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
> + btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);
>  
>   btp->bt_mount = mp;
>   btp->bt_dev =  bdev->bd_dev;
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 003a99b83bd8..033ed65d7ce6 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
>   length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
>   map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
>   (length * sizeof(struct xfs_bmbt_irec)),
> -KM_SLEEP | KM_NOFS);
> +KM_SLEEP | KM_NOLOCKDEP);
>   map_info->map_size = length;
>  
>   /*
> -- 
> 2.10.2
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Cluster-devel] [PATCH v1 43/54] gfs2: convert to bio_for_each_segment_all_sp()

2017-01-05 Thread Ming Lei
Signed-off-by: Ming Lei 
---
 fs/gfs2/lops.c| 3 ++-
 fs/gfs2/meta_io.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index b1f9144b42c7..ddbd1f772cdb 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -208,13 +208,14 @@ static void gfs2_end_log_write(struct bio *bio)
struct bio_vec *bvec;
struct page *page;
int i;
+   struct bvec_iter_all bia;
 
if (bio->bi_error) {
sdp->sd_log_error = bio->bi_error;
fs_err(sdp, "Error %d writing to log\n", bio->bi_error);
}
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
page = bvec->bv_page;
if (page_has_buffers(page))
gfs2_end_log_write_bh(sdp, bvec, bio->bi_error);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 49db8ef13fdf..317cc8ed74ce 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -190,8 +190,9 @@ static void gfs2_meta_read_endio(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
struct buffer_head *bh = page_buffers(page);
unsigned int len = bvec->bv_len;
-- 
2.7.4



Re: [Cluster-devel] [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-01-05 Thread Brian Foster
On Thu, Dec 15, 2016 at 03:07:09PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting

Typos: generic   existing

> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko 
> ---

Otherwise seems fine to me:

Reviewed-by: Brian Foster 

>  fs/xfs/kmem.c |  4 ++--
>  fs/xfs/kmem.h |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c |  6 +++---
>  fs/xfs/xfs_trans.c| 12 ++--
>  include/linux/sched.h |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>* the filesystem here and potentially deadlocking.
>*/
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   noio_flag = memalloc_noio_save();
>  
>   lflags = kmem_flags_convert(flags);
>   ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   memalloc_noio_restore(noio_flag);
>  
>   return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ea3984091d58..e40ddd12900b 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
>   } else {
>   lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   lflags &= ~__GFP_FS;
>   }
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>   struct xfs_btree_split_args *args = container_of(work,
>   struct xfs_btree_split_args, 
> work);
>   unsigned long   pflags;
> - unsigned long   new_pflags = PF_FSTRANS;
> + unsigned long   new_pflags = PF_MEMALLOC_NOFS;
>  
>   /*
>* we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0f56fcd3a5d5..61ca9f9c5a12 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>* We hand off the transaction to the completion thread now, so
>* clear the flag here.
>*/
> - current_restore_flags_nested(>t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>* thus we need to mark ourselves as being in a transaction manually.
>* Similarly for freeze protection.
>*/
> - current_set_flags_nested(>t_pflags, PF_FSTRANS);
> + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>   /* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>* Given that we do not allow direct reclaim to call us, we should
>* never be called while in a filesystem transaction.
>*/
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>   goto redirty;
>  
>   /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
>   boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>   /* Mark this thread as being in a transaction */
> - current_set_flags_nested(>t_pflags, PF_FSTRANS);
> + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>  
>   /*
>* Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ 

Re: [Cluster-devel] [PATCH 5/9] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*

2017-01-05 Thread Brian Foster
On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Signed-off-by: Michal Hocko 
> ---
>  fs/xfs/kmem.c| 10 +-
>  fs/xfs/xfs_buf.c |  8 
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f31ae592dcae..5c6f9bd4d8be 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>   bp->b_addr = NULL;
>   } else {
>   int retried = 0;
> - unsigned noio_flag;
> + unsigned nofs_flag;
>  
>   /*
>* vm_map_ram() will allocate auxillary structures (e.g.
>* pagetables) with GFP_KERNEL, yet we are likely to be under
>* GFP_NOFS context here. Hence we need to tell memory reclaim
> -  * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +  * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>* memory reclaim re-entering the filesystem here and
>* potentially deadlocking.
>*/
> - noio_flag = memalloc_noio_save();
> + nofs_flag = memalloc_nofs_save();
>   do {
>   bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>   -1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>   break;
>   vm_unmap_aliases();
>   } while (retried++ <= 1);
> - memalloc_noio_restore(noio_flag);
> + memalloc_noio_restore(nofs_flag);

memalloc_nofs_restore() ?

Brian

>  
>   if (!bp->b_addr)
>   return -ENOMEM;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [Cluster-devel] [PATCH 0/9 v2] scope GFP_NOFS api

2017-01-05 Thread Mike Galbraith
On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> Hi,
> I have posted the previous version here [1]. Since then I have added a
> support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> removing GFP_NOFS usage motivated by the lockdep false positives. On top
> of that I've tried to convert few KM_NOFS usages to use the new flag in
> the xfs code base. This would need a review from somebody familiar with
> xfs of course.

The wild ass guess below prevents the xfs explosion below when running
ltp zram tests.

---
 fs/xfs/kmem.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 {
gfp_t   lflags;
 
-   BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
+   BUG_ON(flags & 
~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
 
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;

[  108.775501] [ cut here ]
[  108.775503] kernel BUG at fs/xfs/kmem.h:48!
[  108.775504] invalid opcode:  [#1] SMP
[  108.775505] Dumping ftrace buffer:
[  108.775508](ftrace buffer empty)
[  108.775508] Modules linked in: xfs(E) libcrc32c(E) btrfs(E) xor(E) 
raid6_pq(E) zram(E) ebtable_filter(E) ebtables(E) fuse(E) bridge(E) stp(E) 
llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) 
iptable_raw(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) 
coretemp(E) kvm_intel(E) nfsd(E) kvm(E) auth_rpcgss(E) nfs_acl(E) lockd(E) 
snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) pl2303(E) grace(E) 
snd_hda_codec_generic(E) usbserial(E) irqbypass(E) snd_hda_intel(E) 
snd_hda_codec(E)
[  108.775523]  snd_hwdep(E) sunrpc(E) snd_hda_core(E) crct10dif_pclmul(E) 
mei_me(E) mei(E) serio_raw(E) snd_pcm(E) crc32_pclmul(E) snd_timer(E) 
crc32c_intel(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) iTCO_wdt(E) 
iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) snd(E) soundcore(E) joydev(E) 
fan(E) shpchp(E) tpm_infineon(E) cryptd(E) battery(E) thermal(E) pcspkr(E) 
glue_helper(E) usblp(E) intel_smartconnect(E) i2c_i801(E) efivarfs(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) nouveau(E) 
wmi(E) i2c_algo_bit(E) ahci(E) libahci(E) drm_kms_helper(E) syscopyarea(E) 
sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) xhci_pci(E) ttm(E) 
ehci_hcd(E) xhci_hcd(E) r8169(E) libata(E) mii(E) drm(E) usbcore(E) fjes(E) 
video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E)
[  108.775540]  jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) 
autofs4(E)
[  108.775544] CPU: 5 PID: 4495 Comm: mount Tainted: GE   
4.10.0-master #4
[  108.775545] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  108.775546] task: 8803f9e54e00 task.stack: c900018fc000
[  108.775565] RIP: 0010:kmem_flags_convert.part.0+0x4/0x6 [xfs]
[  108.775565] RSP: 0018:c900018ffcd8 EFLAGS: 00010202
[  108.775566] RAX: 8803f630a800 RBX: 8803f6b2 RCX: 1000
[  108.775567] RDX: 1000 RSI: 0031 RDI: 00b0
[  108.775568] RBP: c900018ffcd8 R08: 00019fe0 R09: 8803f6b2
[  108.775568] R10: 0005 R11: 00010641 R12: 88041e21ea00
[  108.775569] R13: 8803f6b2 R14:  R15: fff4
[  108.775570] FS:  7f1cbee9e840() GS:88041ed4() 
knlGS:
[  108.775571] CS:  0010 DS:  ES:  CR0: 80050033
[  108.775571] CR2: 7f5d4b6ed000 CR3: 0003fbf13000 CR4: 001406e0
[  108.775572] Call Trace:
[  108.775588]  kmem_alloc+0x100/0x100 [xfs]
[  108.775591]  ? kstrndup+0x49/0x60
[  108.775605]  xfs_alloc_buftarg+0x23/0xd0 [xfs]
[  108.775619]  xfs_open_devices+0x8c/0x170 [xfs]
[  108.775621]  ? sb_set_blocksize+0x1d/0x50
[  108.775633]  xfs_fs_fill_super+0x234/0x580 [xfs]
[  108.775635]  mount_bdev+0x184/0x1c0
[  108.775647]  ? xfs_test_remount_options.isra.15+0x60/0x60 [xfs]
[  108.775658]  xfs_fs_mount+0x15/0x20 [xfs]
[  108.775659]  mount_fs+0x15/0x90
[  108.775661]  vfs_kern_mount+0x67/0x130
[  108.775663]  do_mount+0x190/0xbd0
[  108.775664]  ? memdup_user+0x42/0x60
[  108.775665]  SyS_mount+0x83/0xd0
[  108.775668]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  108.775669] RIP: 0033:0x7f1cbe7bf78a
[  108.775669] RSP: 002b:7ffc99215198 EFLAGS: 0202 ORIG_RAX: 
00a5
[  108.775670] RAX: ffda RBX: 7f1cbeabb3b8 RCX: 7f1cbe7bf78a
[  108.775671] RDX: 55715a611690 RSI: 55715a60d270 RDI: 55715a60d2d0
[  108.775671] RBP: 55715a60d120 R08: 

Re: [Cluster-devel] [PATCH 0/9 v2] scope GFP_NOFS api

2017-01-05 Thread Mike Galbraith
On Fri, 2016-12-16 at 16:35 +0100, Michal Hocko wrote:
> On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> > On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > > Hi,
> > > I have posted the previous version here [1]. Since then I have added a
> > > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > > the xfs code base. This would need a review from somebody familiar with
> > > xfs of course.
> > 
> > The wild ass guess below prevents the xfs explosion below when running
> > ltp zram tests.
> 
> Yes this looks correct. Thanks for noticing. I will fold it to the
> patch2. Thanks for testing Mike!

I had ulterior motives, was hoping you might have made the irksome RT
gripe below just _go away_, as staring at it ain't working out ;-)

[ 1441.309006] =
[ 1441.309006] [ INFO: possible irq lock inversion dependency detected ]
[ 1441.309007] 4.10.0-rt9-rt #11 Tainted: GE  
[ 1441.309007] -
[ 1441.309008] kswapd0/165 just changed the state of lock:
[ 1441.309009]  (>j_state_lock){+.+.-.}, at: [] 
jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309017] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1441.309017]  (>tb6_lock){+.+.+.}
[ 1441.309018] and interrupts could create inverse lock ordering between them.
[ 1441.309018] other info that might help us debug this:
[ 1441.309018] Chain exists of: >j_state_lock --> 
>j_list_lock --> >tb6_lock
[ 1441.309019]  Possible interrupt unsafe locking scenario:
[ 1441.309019]CPU0CPU1
[ 1441.309019]
[ 1441.309019]   lock(>tb6_lock);
[ 1441.309020]local_irq_disable();
[ 1441.309020]lock(>j_state_lock);
[ 1441.309020]lock(>j_list_lock);
[ 1441.309021]   
[ 1441.309021] lock(>j_state_lock);
[ 1441.309021] *** DEADLOCK ***
[ 1441.309022] 2 locks held by kswapd0/165:
[ 1441.309022]  #0:  (shrinker_rwsem){+.+...}, at: [] 
shrink_slab+0x7a/0x6c0
[ 1441.309027]  #1:  (>s_umount_key#29){+.+.+.}, at: [] 
trylock_super+0x1b/0x50
[ 1441.309030] the shortest dependencies between 2nd lock and 1st lock:
[ 1441.309031]-> (>tb6_lock){+.+.+.} ops: 271 {
[ 1441.309032]   HARDIRQ-ON-W at:
[ 1441.309035] [] __lock_acquire+0x938/0x1770
[ 1441.309036] [] lock_acquire+0xd4/0x270
[ 1441.309039] [] rt_write_lock+0x31/0x40
[ 1441.309041] [] __ip6_ins_rt+0x33/0x70
[ 1441.309043] [] ip6_route_add+0x81/0xd0
[ 1441.309044] [] addrconf_prefix_route+0x133/0x1d0
[ 1441.309046] [] inet6_addr_add+0x1eb/0x250
[ 1441.309047] [] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309049] [] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309051] [] netlink_rcv_skb+0xa7/0xc0
[ 1441.309053] [] rtnetlink_rcv+0x28/0x30
[ 1441.309054] [] netlink_unicast+0x143/0x1f0
[ 1441.309055] [] netlink_sendmsg+0x322/0x3a0
[ 1441.309057] [] sock_sendmsg+0x38/0x50
[ 1441.309058] [] SYSC_sendto+0xf6/0x170
[ 1441.309060] [] SyS_sendto+0xe/0x10
[ 1441.309061] [] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309061]   SOFTIRQ-ON-W at:
[ 1441.309063] [] __lock_acquire+0x283/0x1770
[ 1441.309064] [] lock_acquire+0xd4/0x270
[ 1441.309064] [] rt_write_lock+0x31/0x40
[ 1441.309065] [] __ip6_ins_rt+0x33/0x70
[ 1441.309067] [] ip6_route_add+0x81/0xd0
[ 1441.309067] [] addrconf_prefix_route+0x133/0x1d0
[ 1441.309068] [] inet6_addr_add+0x1eb/0x250
[ 1441.309069] [] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309071] [] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309073] [] netlink_rcv_skb+0xa7/0xc0
[ 1441.309074] [] rtnetlink_rcv+0x28/0x30
[ 1441.309075] [] netlink_unicast+0x143/0x1f0
[ 1441.309077] [] netlink_sendmsg+0x322/0x3a0
[ 1441.309078] [] sock_sendmsg+0x38/0x50
[ 1441.309079] [] SYSC_sendto+0xf6/0x170
[ 1441.309080] [] SyS_sendto+0xe/0x10
[ 1441.309081] [] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309081]   RECLAIM_FS-ON-W at:
[ 1441.309082] [] mark_held_locks+0x66/0x90
[ 1441.309084] [] lockdep_trace_alloc+0xd8/0x120
[ 1441.309085] [] kmem_cache_alloc_node+0x36/0x310
[ 1441.309086] [] __alloc_skb+0x4e/0x280
[ 1441.309088] [] inet6_rt_notify+0x5c/0x130
[ 1441.309089] [] fib6_add+0x56b/0xa30
[ 1441.309090] [] __ip6_ins_rt+0x48/0x70
[ 1441.309091] [] ip6_route_add+0x81/0xd0
[ 1441.309092] [] addrconf_prefix_route+0x133/0x1d0
[ 1441.309093] [] inet6_addr_add+0x1eb/0x250
[ 1441.309094] [] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309096] [] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309097] [] netlink_rcv_skb+0xa7/0xc0
[ 1441.309098] [] rtnetlink_rcv+0x28/0x30
[ 1441.309099] [] netlink_unicast+0x143/0x1f0
[ 1441.309100] [] netlink_sendmsg+0x322/0x3a0
[ 1441.309102] [] sock_sendmsg+0x38/0x50
[ 1441.309103] [] SYSC_sendto+0xf6/0x170
[ 1441.309104] [] SyS_sendto+0xe/0x10
[ 1441.309105] [] 

Re: [Cluster-devel] [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2017-01-05 Thread Darrick J. Wong
On Tue, Dec 20, 2016 at 08:24:13AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> > 
> > Signed-off-by: Michal Hocko 
> 
> I'd suggest that it might be better to drop this patch for now -
> it's not necessary for the context flag changeover but does
> introduce a risk of regressions if the conversion is wrong.

I was just about to write in that while I didn't see anything obviously
wrong with the NOFS removals, I also don't know for sure that we can't
end up recursively in those code paths (specifically the directory
traversal thing).

--D

> Hence I think this is better as a completely separate series
> which audits and changes all the unnecessary KM_NOFS allocations
> in one go. I've never liked whack-a-mole style changes like this -
> do it once, do it properly
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html