Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 12:27:29, Mike Christie wrote:
> On 10/23/2019 02:11 AM, Michal Hocko wrote:
> > On Wed 23-10-19 07:43:44, Dave Chinner wrote:
> >> On Tue, Oct 22, 2019 at 06:33:10PM +0200, Michal Hocko wrote:
> > 
> > Thanks for more clarifiation regarding PF_LESS_THROTTLE.
> > 
> > [...]
> >>> PF_IO_FLUSHER would mean that the user
> >>> context is a part of the IO path and therefore there are certain reclaim
> >>> recursion restrictions.
> >>
> >> If PF_IO_FLUSHER just maps to PF_LESS_THROTTLE|PF_MEMALLOC_NOIO,
> >> then I'm not sure we need a new definition. Maybe that's the ptrace
> >> flag name, but in the kernel we don't need a PF_IO_FLUSHER process
> >> flag...
> > 
> > Yes, the internal implementation would do something like that. I was
> > more interested in the user space visible API at this stage. Something
> > generic enough because exporting MEMALLOC flags is just a bad idea IMHO
> > (especially PF_MEMALLOC).
> 
> Do you mean we would do something like:
> 
> prctl()
> 
> case PF_SET_IO_FLUSHER:
> current->flags |= PF_MEMALLOC_NOIO;
> 

yes, along with PF_LESS_THROTTLE.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 07:43:44, Dave Chinner wrote:
> On Tue, Oct 22, 2019 at 06:33:10PM +0200, Michal Hocko wrote:

Thanks for more clarifiation regarding PF_LESS_THROTTLE.

[...]
> > PF_IO_FLUSHER would mean that the user
> > context is a part of the IO path and therefore there are certain reclaim
> > recursion restrictions.
> 
> If PF_IO_FLUSHER just maps to PF_LESS_THROTTLE|PF_MEMALLOC_NOIO,
> then I'm not sure we need a new definition. Maybe that's the ptrace
> flag name, but in the kernel we don't need a PF_IO_FLUSHER process
> flag...

Yes, the internal implementation would do something like that. I was
more interested in the user space visible API at this stage. Something
generic enough because exporting MEMALLOC flags is just a bad idea IMHO
(especially PF_MEMALLOC).

> > > >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> > > >> with prctl during their initialization so later allocations cannot
> > > >> calling back into them.
> > > > 
> > > > TBH I am not really happy to export these to the userspace. They are
> > > > an internal implementation detail and the userspace shouldn't really
> > > 
> > > They care in these cases, because block/fs drivers must be able to make
> > > forward progress during writes. To meet this guarantee kernel block
> > > drivers use mempools and memalloc/GFP flags.
> > > 
> > > For these userspace components of the block/fs drivers they already do
> > > things normal daemons do not to meet that guarantee like mlock their
> > > memory, disable oom killer, and preallocate resources they have control
> > > over. They have no control over reclaim like the kernel drivers do so
> > > its easy for us to deadlock when memory gets low.
> > 
> > OK, fair enough. How much of a control do they really need though. Is a
> > single PF_IO_FLUSHER as explained above (essentially imply GPF_NOIO
> > context) sufficient?
> 
> I think some of these usrspace processes work at the filesystem
> level and so really only need GFP_NOFS allocation (fuse), while
> others work at the block device level (iscsi, nbd) so need GFP_NOIO
> allocation. So there's definitely an argument for providing both...

The main question is whether giving more APIs is really necessary. Is
there any real problem to give them only PF_IO_FLUSHER and let both
groups use this one? It will imply more reclaim restrictions for solely
FS based ones but is this a practical problem? If yes we can always add
PF_FS_$FOO later on.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 11:13:20, Mike Christie wrote:
> On 10/22/2019 06:24 AM, Michal Hocko wrote:
> > On Mon 21-10-19 16:41:37, Mike Christie wrote:
> >> There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
> >> amd nbd that have userspace components that can run in the IO path. For
> >> example, iscsi and nbd's userspace deamons may need to recreate a socket
> >> and/or send IO on it, and dm-multipath's daemon multipathd may need to
> >> send IO to figure out the state of paths and re-set them up.
> >>
> >> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> >> memalloc_*_save/restore functions to control the allocation behavior,
> >> but for userspace we would end up hitting a allocation that ended up
> >> writing data back to the same device we are trying to allocate for.
> > 
> > Which code paths are we talking about here? Any ioctl or is this a
> > general syscall path? Can we mark the process in a more generic way?
> 
> It depends on the daemon. The common one for example are iscsi and nbd
> need network related calls like sendmsg, recvmsg, socket, etc.
> tcmu-runner could need the network ones and also read and write when it
> does IO to a FS or device. dm-multipath needs the sg io ioctls.

OK, so there is not a clear kernel entry point that could be explicitly
annotated. This would imply a per task context. This is an important
information. And I am wondering how those usecases ever worked in the
first place. This is not a minor detail.
 
> > E.g. we have PF_LESS_THROTTLE (used by nfsd). It doesn't affect the
> > reclaim recursion but it shows a pattern that doesn't really exhibit
> > too many internals. Maybe we need PF_IO_FLUSHER or similar?
> 
> I am not familiar with PF_IO_FLUSHER. If it prevents the recursion
> problem then please send me details and I will look into it for the next
> posting.

PF_IO_FLUSHER doesn't exist. I just wanted to point out that similarly
to PF_LESS_THROTTLE it should be a more high level per task flag rather
than something as low level as a direct control of gfp allocation
context. PF_LESS_THROTTLE simply tells that the task is a part of the
reclaim process and therefore it shouldn't be a subject of a normal
throttling - whatever that means. PF_IO_FLUSHER would mean that the user
context is a part of the IO path and therefore there are certain reclaim
recursion restrictions.
 
> >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> >> with prctl during their initialization so later allocations cannot
> >> calling back into them.
> > 
> > TBH I am not really happy to export these to the userspace. They are
> > an internal implementation detail and the userspace shouldn't really
> 
> They care in these cases, because block/fs drivers must be able to make
> forward progress during writes. To meet this guarantee kernel block
> drivers use mempools and memalloc/GFP flags.
> 
> For these userspace components of the block/fs drivers they already do
> things normal daemons do not to meet that guarantee like mlock their
> memory, disable oom killer, and preallocate resources they have control
> over. They have no control over reclaim like the kernel drivers do so
> its easy for us to deadlock when memory gets low.

OK, fair enough. How much of a control do they really need though. Is a
single PF_IO_FLUSHER as explained above (essentially imply GPF_NOIO
context) sufficient?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-22 Thread Michal Hocko
On Mon 21-10-19 16:41:37, Mike Christie wrote:
> There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
> amd nbd that have userspace components that can run in the IO path. For
> example, iscsi and nbd's userspace deamons may need to recreate a socket
> and/or send IO on it, and dm-multipath's daemon multipathd may need to
> send IO to figure out the state of paths and re-set them up.
> 
> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> memalloc_*_save/restore functions to control the allocation behavior,
> but for userspace we would end up hitting a allocation that ended up
> writing data back to the same device we are trying to allocate for.

Which code paths are we talking about here? Any ioctl or is this a
general syscall path? Can we mark the process in a more generic way?
E.g. we have PF_LESS_THROTTLE (used by nfsd). It doesn't affect the
reclaim recursion but it shows a pattern that doesn't really exhibit
too many internals. Maybe we need PF_IO_FLUSHER or similar?

> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> with prctl during their initialization so later allocations cannot
> calling back into them.

TBH I am not really happy to export these to the userspace. They are
an internal implementation detail and the userspace shouldn't really
care. So if this is really necessary then we need a very good argumnets
and documentation to make the usage clear.
 
> Signed-off-by: Mike Christie 
> ---
> 
> V2:
> - Use prctl instead of procfs.
> - Add support for NOFS for fuse.
> - Check permissions.
> 
>  include/uapi/linux/prctl.h |  8 +++
>  kernel/sys.c   | 44 ++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 7da1b37b27aa..6f6b3af6633a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -234,4 +234,12 @@ struct prctl_mm_map {
>  #define PR_GET_TAGGED_ADDR_CTRL  56
>  # define PR_TAGGED_ADDR_ENABLE   (1UL << 0)
>  
> +/* Control reclaim behavior when allocating memory */
> +#define PR_SET_MEMALLOC  57
> +#define PR_GET_MEMALLOC  58
> +#define PR_MEMALLOC_SET_NOIO (1UL << 0)
> +#define PR_MEMALLOC_CLEAR_NOIO   (1UL << 1)
> +#define PR_MEMALLOC_SET_NOFS (1UL << 2)
> +#define PR_MEMALLOC_CLEAR_NOFS   (1UL << 3)
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a611d1d58c7d..34fedc9fc7e4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,50 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   return -EINVAL;
>   error = GET_TAGGED_ADDR_CTRL();
>   break;
> + case PR_SET_MEMALLOC:
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> +
> + switch (arg2) {
> + case PR_MEMALLOC_SET_NOIO:
> + if (current->flags & PF_MEMALLOC_NOFS)
> + return -EINVAL;
> +
> + current->flags |= PF_MEMALLOC_NOIO;
> + break;
> + case PR_MEMALLOC_CLEAR_NOIO:
> + current->flags &= ~PF_MEMALLOC_NOIO;
> + break;
> + case PR_MEMALLOC_SET_NOFS:
> + if (current->flags & PF_MEMALLOC_NOIO)
> + return -EINVAL;
> +
> + current->flags |= PF_MEMALLOC_NOFS;
> + break;
> + case PR_MEMALLOC_CLEAR_NOFS:
> + current->flags &= ~PF_MEMALLOC_NOFS;
> + break;
> + default:
> + return -EINVAL;
> + }
> + break;
> + case PR_GET_MEMALLOC:
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> +
> + if (current->flags & PF_MEMALLOC_NOIO)
> + error = PR_MEMALLOC_SET_NOIO;
> + else if (current->flags & PF_MEMALLOC_NOFS)
> + error = PR_MEMALLOC_SET_NOFS;
> + else
> + error = 0;
> + break;
>   default:
>   error = -EINVAL;
>   break;
> -- 
> 2.20.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/2] mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)

2019-09-30 Thread Michal Hocko
On Mon 23-09-19 18:36:32, Vlastimil Babka wrote:
> On 8/26/19 1:16 PM, Vlastimil Babka wrote:
> > In most configurations, kmalloc() happens to return naturally aligned (i.e.
> > aligned to the block size itself) blocks for power of two sizes. That means
> > some kmalloc() users might unknowingly rely on that alignment, until stuff
> > breaks when the kernel is built with e.g.  CONFIG_SLUB_DEBUG or CONFIG_SLOB,
> > and blocks stop being aligned. Then developers have to devise workaround 
> > such
> > as own kmem caches with specified alignment [1], which is not always 
> > practical,
> > as recently evidenced in [2].
> > 
> > The topic has been discussed at LSF/MM 2019 [3]. Adding a 
> > 'kmalloc_aligned()'
> > variant would not help with code unknowingly relying on the implicit 
> > alignment.
> > For slab implementations it would either require creating more kmalloc 
> > caches,
> > or allocate a larger size and only give back part of it. That would be
> > wasteful, especially with a generic alignment parameter (in contrast with a
> > fixed alignment to size).
> > 
> > Ideally we should provide to mm users what they need without difficult
> > workarounds or own reimplementations, so let's make the kmalloc() alignment 
> > to
> > size explicitly guaranteed for power-of-two sizes under all configurations.
> > What this means for the three available allocators?
> > 
> > * SLAB object layout happens to be mostly unchanged by the patch. The
> >   implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB 
> > due
> >   to redzoning, however SLAB disables redzoning for caches with alignment
> >   larger than unsigned long long. Practically on at least x86 this includes
> >   kmalloc caches as they use cache line alignment, which is larger than 
> > that.
> >   Still, this patch ensures alignment on all arches and cache sizes.
> > 
> > * SLUB layout is also unchanged unless redzoning is enabled through
> >   CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. 
> > With
> >   this patch, explicit alignment is guaranteed with redzoning as well. This
> >   will result in more memory being wasted, but that should be acceptable in 
> > a
> >   debugging scenario.
> > 
> > * SLOB has no implicit alignment so this patch adds it explicitly for
> >   kmalloc(). The potential downside is increased fragmentation. While
> >   pathological allocation scenarios are certainly possible, in my testing,
> >   after booting a x86_64 kernel+userspace with virtme, around 16MB memory
> >   was consumed by slab pages both before and after the patch, with 
> > difference
> >   in the noise.
> > 
> > [1] 
> > https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.le...@c-s.fr/
> > [2] 
> > https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming@redhat.com/
> > [3] https://lwn.net/Articles/787740/
> > 
> > Signed-off-by: Vlastimil Babka 
> 
> So if anyone thinks this is a good idea, please express it (preferably
> in a formal way such as Acked-by), otherwise it seems the patch will be
> dropped (due to a private NACK, apparently).

Sigh.

An existing code to workaround the lack of alignment guarantee just show
that this is necessary. And there wasn't any real technical argument
against except for a highly theoretical optimizations/new allocator that
would be tight by the guarantee.

Therefore
Acked-by: Michal Hocko 

-- 
Michal Hocko
SUSE Labs


5.3-rc-8 hung task in IO (was: Re: lot of MemAvailable but falling cache and raising PSI)

2019-09-11 Thread Michal Hocko
 [55705.736285] INFO: task rsync:9830 blocked for more than 1087 seconds.
> [55705.736999]   Not tainted 5.3.0-rc8 #1
> [55705.737694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [55705.738411] rsync   D0  9830   9829 0x4002
> [55705.739072] Call Trace:
> [55705.739455]  ? __schedule+0x3cf/0x680
> [55705.739837]  ? bit_wait+0x50/0x50
> [55705.740215]  schedule+0x39/0xa0
> [55705.740610]  io_schedule+0x12/0x40
> [55705.741243]  bit_wait_io+0xd/0x50
> [55705.741897]  __wait_on_bit+0x66/0x90
> [55705.742524]  ? bit_wait+0x50/0x50
> [55705.743131]  out_of_line_wait_on_bit+0x8b/0xb0
> [55705.743750]  ? init_wait_var_entry+0x40/0x40
> [55705.744128]  lock_extent_buffer_for_io+0x10b/0x2c0 [btrfs]
> [55705.744766]  btree_write_cache_pages+0x17d/0x350 [btrfs]
> [55705.745440]  ? btrfs_set_token_32+0x72/0x130 [btrfs]
> [55705.746118]  ? merge_state.part.47+0x3f/0x160 [btrfs]
> [55705.746753]  do_writepages+0x1a/0x60
> [55705.747411]  __filemap_fdatawrite_range+0xc8/0x100
> [55705.748106]  ? convert_extent_bit+0x2e8/0x580 [btrfs]
> [55705.748807]  btrfs_write_marked_extents+0x141/0x160 [btrfs]
> [55705.749495]  btrfs_write_and_wait_transaction.isra.26+0x58/0xb0 [btrfs]
> [55705.750190]  ? btrfs_commit_transaction+0x752/0x9d0 [btrfs]
> [55705.750890]  btrfs_commit_transaction+0x752/0x9d0 [btrfs]
> [55705.751580]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
> [55705.752293]  btrfs_sync_file+0x395/0x3e0 [btrfs]
> [55705.752981]  ? retarget_shared_pending+0x70/0x70
> [55705.753686]  do_fsync+0x38/0x60
> [55705.754340]  __x64_sys_fdatasync+0x13/0x20
> [55705.755012]  do_syscall_64+0x55/0x1a0
> [55705.755678]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [55705.756375] RIP: 0033:0x7f1db3fc85f0
> [55705.757042] Code: Bad RIP value.
> [55705.757690] RSP: 002b:7ffe6f827db8 EFLAGS: 0246 ORIG_RAX:
> 004b
> [55705.758300] RAX: ffda RBX: 0001 RCX:
> 7f1db3fc85f0
> [55705.758678] RDX: 7f1db4aa6060 RSI: 0003 RDI:
> 0001
> [55705.759107] RBP: 0001 R08:  R09:
> 81c492ca
> [55705.759785] R10: 0008 R11: 0246 R12:
> 0028
> [55705.760471] R13: 7ffe6f827e40 R14:  R15:
> 
> [55826.570182] INFO: task rsync:9830 blocked for more than 1208 seconds.
> [55826.571349]   Not tainted 5.3.0-rc8 #1
> [55826.572469] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [55826.573618] rsync   D0  9830   9829 0x4002
> [55826.574790] Call Trace:
> [55826.575932]  ? __schedule+0x3cf/0x680
> [55826.577079]  ? bit_wait+0x50/0x50
> [55826.578233]  schedule+0x39/0xa0
> [55826.579350]  io_schedule+0x12/0x40
> [55826.580451]  bit_wait_io+0xd/0x50
> [55826.581527]  __wait_on_bit+0x66/0x90
> [55826.582596]  ? bit_wait+0x50/0x50
> [55826.583178]  out_of_line_wait_on_bit+0x8b/0xb0
> [55826.583550]  ? init_wait_var_entry+0x40/0x40
> [55826.583953]  lock_extent_buffer_for_io+0x10b/0x2c0 [btrfs]
> [55826.584356]  btree_write_cache_pages+0x17d/0x350 [btrfs]
> [55826.584755]  ? btrfs_set_token_32+0x72/0x130 [btrfs]
> [55826.585155]  ? merge_state.part.47+0x3f/0x160 [btrfs]
> [55826.585547]  do_writepages+0x1a/0x60
> [55826.585937]  __filemap_fdatawrite_range+0xc8/0x100
> [55826.586352]  ? convert_extent_bit+0x2e8/0x580 [btrfs]
> [55826.586761]  btrfs_write_marked_extents+0x141/0x160 [btrfs]
> [55826.587171]  btrfs_write_and_wait_transaction.isra.26+0x58/0xb0 [btrfs]
> [55826.587581]  ? btrfs_commit_transaction+0x752/0x9d0 [btrfs]
> [55826.587990]  btrfs_commit_transaction+0x752/0x9d0 [btrfs]
> [55826.588406]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
> [55826.588818]  btrfs_sync_file+0x395/0x3e0 [btrfs]
> [55826.589219]  ? retarget_shared_pending+0x70/0x70
> [55826.589617]  do_fsync+0x38/0x60
> [55826.590011]  __x64_sys_fdatasync+0x13/0x20
> [55826.590411]  do_syscall_64+0x55/0x1a0
> [55826.590798]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [55826.591185] RIP: 0033:0x7f1db3fc85f0
> [55826.591572] Code: Bad RIP value.
> [55826.591952] RSP: 002b:7ffe6f827db8 EFLAGS: 0246 ORIG_RAX:
> 004b
> [55826.592347] RAX: ffda RBX: 0001 RCX:
> 7f1db3fc85f0
> [55826.592743] RDX: 7f1db4aa6060 RSI: 0003 RDI:
> 0001
> [55826.593143] RBP: 0001 R08:  R09:
> 81c492ca
> [55826.593543] R10: 0008 R11: 0246 R12:
> 0028
> [55826.593941] R13: 7ffe6f827e40 R14:  R15:
> 
> 
> 
> Greets,
> Stefan

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/2] mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)

2019-08-29 Thread Michal Hocko
On Wed 28-08-19 12:46:08, Matthew Wilcox wrote:
> On Wed, Aug 28, 2019 at 06:45:07PM +, Christopher Lameter wrote:
[...]
> > be suprising and it limits the optimizations that slab allocators may use
> > for optimizing data use. The SLOB allocator was designed in such a way
> > that data wastage is limited. The changes here sabotage that goal and show
> > that future slab allocators may be similarly constrained with the
> > exceptional alignents implemented. Additional debugging features etc etc
> > must all support the exceptional alignment requirements.
> 
> While I sympathise with the poor programmer who has to write the
> fourth implementation of the sl*b interface, it's more for the pain of
> picking a new letter than the pain of needing to honour the alignment
> of allocations.
> 
> There are many places in the kernel which assume alignment.  They break
> when it's not supplied.  I believe we have a better overall system if
> the MM developers provide stronger guarantees than the MM consumers have
> to work around only weak guarantees.

I absolutely agree. A hypothetical benefit of a new implementation
doesn't outweigh the complexity the existing code has to jump over or
worse is not aware of and it is broken silently. My general experience
is that the later is more likely with a large variety of drivers we have
in the tree and odd things they do in general.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-07 Thread Michal Hocko
On Wed 07-08-19 10:37:26, Jan Kara wrote:
> On Fri 02-08-19 12:14:09, John Hubbard wrote:
> > On 8/2/19 7:52 AM, Jan Kara wrote:
> > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > > > > On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
> > > > > > [...]
> > > > > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > > > > invoke put_user_page*(), instead of put_page(). This involves 
> > > > > > > dozens of
> > > > > > > call sites, and will take some time.
> > > > > > 
> > > > > > How do we make sure this is the case and it will remain the case in 
> > > > > > the
> > > > > > future? There must be some automagic to enforce/check that. It is 
> > > > > > simply
> > > > > > not manageable to do it every now and then because then 3) will 
> > > > > > simply
> > > > > > be never safe.
> > > > > > 
> > > > > > Have you considered coccinele or some other scripted way to do the
> > > > > > transition? I have no idea how to deal with future changes that 
> > > > > > would
> > > > > > break the balance though.
> > 
> > Hi Michal,
> > 
> > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart
> > enough to know which put_page()'s to convert). However, there is a debug
> > option planned: a yet-to-be-posted commit [1] uses struct page extensions
> > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add
> > a redundant counter. That allows:
> > 
> > void __put_page(struct page *page)
> > {
> > ...
> > /* Someone called put_page() instead of put_user_page() */
> > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0);
> > 
> > > > > 
> > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to 
> > > > > create
> > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > > > > references got converted by using this wrapper instead of gup. The
> > > > > counterpart would then be more logically named as unpin_page() or 
> > > > > whatever
> > > > > instead of put_user_page().  Sure this is not completely foolproof 
> > > > > (you can
> > > > > create new callsite using vaddr_pin_pages() and then just drop refs 
> > > > > using
> > > > > put_page()) but I suppose it would be a high enough barrier for missed
> > > > > conversions... Thoughts?
> > 
> > The debug option above is still a bit simplistic in its implementation
> > (and maybe not taking full advantage of the data it has), but I think
> > it's preferable, because it monitors the "core" and WARNs.
> > 
> > Instead of the wrapper, I'm thinking: documentation and the passage of
> > time, plus the debug option (perhaps enhanced--probably once I post it
> > someone will notice opportunities), yes?
> 
> So I think your debug option and my suggested renaming serve a bit
> different purposes (and thus both make sense). If you do the renaming, you
> can just grep to see unconverted sites. Also when someone merges new GUP
> user (unaware of the new rules) while you switch GUP to use pins instead of
> ordinary references, you'll get compilation error in case of renaming
> instead of hard to debug refcount leak without the renaming. And such
> conflict is almost bound to happen given the size of GUP patch set... Also
> the renaming serves against the "coding inertia" - i.e., GUP is around for
> ages so people just use it without checking any documentation or comments.
> After switching how GUP works, what used to be correct isn't anymore so
> renaming the function serves as a warning that something has really
> changed.

Fully agreed!

> Your refcount debug patches are good to catch bugs in the conversions done
> but that requires you to be able to excercise the code path in the first
> place which may require particular HW or so, and you also have to enable
> the debug option which means you already aim at verifying the GUP
> references are treated properly.
> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Michal Hocko
On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
[...]
> 2) Convert all of the call sites for get_user_pages*(), to
> invoke put_user_page*(), instead of put_page(). This involves dozens of
> call sites, and will take some time.

How do we make sure this is the case and it will remain the case in the
future? There must be some automagic to enforce/check that. It is simply
not manageable to do it every now and then because then 3) will simply
be never safe.

Have you considered coccinele or some other scripted way to do the
transition? I have no idea how to deal with future changes that would
break the balance though.
-- 
Michal Hocko
SUSE Labs


Re: [LSF/MM TOPIC] guarantee natural alignment for kmalloc()?

2019-04-25 Thread Michal Hocko
On Thu 25-04-19 04:33:59, Matthew Wilcox wrote:
> On Thu, Apr 11, 2019 at 06:28:19AM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 11, 2019 at 02:52:08PM +0200, Vlastimil Babka wrote:
> > > In the session I hope to resolve the question whether this is indeed the
> > > right thing to do for all kmalloc() users, without an explicit alignment
> > > requests, and if it's worth the potentially worse
> > > performance/fragmentation it would impose on a hypothetical new slab
> > > implementation for which it wouldn't be optimal to split power-of-two
> > > sized pages into power-of-two-sized objects (or whether there are any
> > > other downsides).
> > 
> > I think this is exactly the kind of discussion that LSFMM is for!  It's
> > really a whole-system question; is Linux better-off having the flexibility
> > for allocators to return non-power-of-two aligned memory, or allowing
> > consumers of the kmalloc API to assume that "sufficiently large" memory
> > is naturally aligned.
> 
> This has been scheduled for only the MM track.  I think at least the
> filesystem people should be involved in this discussion since it's for
> their benefit.

Agreed. I have marked it as a MM/IO/FS track, we just haven't added it
to the schedule that way. I still plan to go over all topics again and
consolidate the current (very preliminary) schedule. Thanks for catching
this up.

> Do we have an lsf-discuss mailing list this year?  Might be good to
> coordinate arrivals / departures for taxi sharing purposes.

Yes, the list should be established AFAIK and same address as last
years.

-- 
Michal Hocko
SUSE Labs


Re: [Lsf-pc] LSF/MM 2019: Call for Proposals (UPDATED!)

2019-02-14 Thread Michal Hocko
On Thu 07-02-19 08:35:06, Jens Axboe wrote:
[...]
> 2) Requests to attend the summit for those that are not proposing a
> topic should be sent to:
> 
>   lsf...@lists.linux-foundation.org
> 
> Please summarize what expertise you will bring to the meeting, and
> what you would like to discuss. Please also tag your email with
> [LSF/MM ATTEND] and send it as a new thread so there is less chance of
> it getting lost.

Just a reminder. Please do not forget to send the ATTEND request and
make it clear which track you would like to participate in the most.
It is quite common that people only send topic proposals without and
expclicit ATTEND request or they do not mention the track.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: Block layer use of __GFP flags

2018-04-09 Thread Michal Hocko
On Mon 09-04-18 15:03:45, Bart Van Assche wrote:
> On Mon, 2018-04-09 at 11:00 +0200, Michal Hocko wrote:
> > On Mon 09-04-18 04:46:22, Bart Van Assche wrote:
> > [...]
> > [...]
> > > diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> > > index ad8a125defdd..3ddb464b72e6 100644
> > > --- a/drivers/ide/ide-pm.c
> > > +++ b/drivers/ide/ide-pm.c
> > > @@ -91,7 +91,7 @@ int generic_ide_resume(struct device *dev)
> > >  
> > >   memset(&rqpm, 0, sizeof(rqpm));
> > >   rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
> > > -BLK_MQ_REQ_PREEMPT);
> > > +BLK_MQ_REQ_PREEMPT, __GFP_RECLAIM);
> > 
> > Is there any reason to use __GFP_RECLAIM directly. I guess you wanted to
> > have GFP_NOIO semantic, right? So why not be explicit about that. Same
> > for other instances of this flag in the patch
> 
> Hello Michal,
> 
> Thanks for the review. The use of __GFP_RECLAIM in this code (which was
> called __GFP_WAIT in the past) predates the git history.

Yeah, __GFP_WAIT -> __GFP_RECLAIM was a pseudo automated change IIRC.
Anyway GFP_NOIO should be pretty much equivalent and self explanatory.
__GFP_RECLAIM is more of an internal thing than something be for used as
a plain gfp mask.

Sure, there is no real need to change that but if you want to make the
code more neat and self explanatory I would go with GFP_NOIO.

Just my 2c
-- 
Michal Hocko
SUSE Labs


Re: Block layer use of __GFP flags

2018-04-09 Thread Michal Hocko
On Mon 09-04-18 04:46:22, Bart Van Assche wrote:
[...]
[...]
> diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> index ad8a125defdd..3ddb464b72e6 100644
> --- a/drivers/ide/ide-pm.c
> +++ b/drivers/ide/ide-pm.c
> @@ -91,7 +91,7 @@ int generic_ide_resume(struct device *dev)
>  
>   memset(&rqpm, 0, sizeof(rqpm));
>   rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
> -BLK_MQ_REQ_PREEMPT);
> +BLK_MQ_REQ_PREEMPT, __GFP_RECLAIM);

Is there any reason to use __GFP_RECLAIM directly. I guess you wanted to
have GFP_NOIO semantic, right? So why not be explicit about that. Same
for other instances of this flag in the patch
-- 
Michal Hocko
SUSE Labs


Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-03-14 Thread Michal Hocko
On Tue 06-03-18 20:28:59, Tetsuo Handa wrote:
> Laura Abbott wrote:
> > On 02/26/2018 06:28 AM, Michal Hocko wrote:
> > > On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> > >> Hi,
> > >>
> > >> The Fedora arm-32 build VMs have a somewhat long standing problem
> > >> of hanging when running mkfs.ext4 with a bunch of processes stuck
> > >> in D state. This has been seen as far back as 4.13 but is still
> > >> present on 4.14:
> > >>
> > > [...]
> > >> This looks like everything is blocked on the writeback completing but
> > >> the writeback has been throttled. According to the infra team, this 
> > >> problem
> > >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> > >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> > >> quite match since this seems to be completely stuck. Any suggestions to
> > >> narrow the problem down?
> > > 
> > > How much dirtyable memory does the system have? We do allow only lowmem
> > > to be dirtyable by default on 32b highmem systems. Maybe you have the
> > > lowmem mostly consumed by the kernel memory. Have you tried to enable
> > > highmem_is_dirtyable?
> > > 
> > 
> > Setting highmem_is_dirtyable did fix the problem. The infrastructure
> > people seemed satisfied enough with this (and are happy to have the
> > machines back).
> 
> That's good.
> 
> > I'll see if they are willing to run a few more tests
> > to get some more state information.
> 
> Well, I'm far from understanding what is happening in your case, but I'm
> interested in other threads which were trying to allocate memory. Therefore,
> I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described
> at http://akari.osdn.jp/capturing-kernel-messages.html ).
> 
> Code which assumes that kswapd can make progress can get stuck when kswapd
> is blocked somewhere. And wbt_wait() seems to change behavior based on
> current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot
> make progress, I worry that it leads to hangups like your case.

Tetsuo, could you stop this finally, pretty please? This is a
well known limitation of 32b architectures with more than 4G. The lowmem
can only handle 896MB of memory and that can be filled up with other
kernel allocations. Stalled writeback is _usually_ a result of only
little dirtyable memory which is left in the lowmem. We cannot simply
allow highmem to be dirtyable by default due to reasons explained in
other email.

I can imagine that it is hard for you to grasp that not everything is
"silent hang during OOM" but there are other things going on in the VM.
-- 
Michal Hocko
SUSE Labs


Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-03-14 Thread Michal Hocko
On Mon 05-03-18 13:04:24, Laura Abbott wrote:
> On 02/26/2018 06:28 AM, Michal Hocko wrote:
> > On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> > > Hi,
> > > 
> > > The Fedora arm-32 build VMs have a somewhat long standing problem
> > > of hanging when running mkfs.ext4 with a bunch of processes stuck
> > > in D state. This has been seen as far back as 4.13 but is still
> > > present on 4.14:
> > > 
> > [...]
> > > This looks like everything is blocked on the writeback completing but
> > > the writeback has been throttled. According to the infra team, this 
> > > problem
> > > is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> > > https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> > > quite match since this seems to be completely stuck. Any suggestions to
> > > narrow the problem down?
> > 
> > How much dirtyable memory does the system have? We do allow only lowmem
> > to be dirtyable by default on 32b highmem systems. Maybe you have the
> > lowmem mostly consumed by the kernel memory. Have you tried to enable
> > highmem_is_dirtyable?
> > 
> 
> Setting highmem_is_dirtyable did fix the problem. The infrastructure
> people seemed satisfied enough with this (and are happy to have the
> machines back). I'll see if they are willing to run a few more tests
> to get some more state information.

Please be aware that highmem_is_dirtyable is not for free. There are
some code paths which can only allocate from lowmem (e.g. block device
AFAIR) and those could fill up the whole lowmem without any throttling.
-- 
Michal Hocko
SUSE Labs


Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-02-26 Thread Michal Hocko
On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> Hi,
> 
> The Fedora arm-32 build VMs have a somewhat long standing problem
> of hanging when running mkfs.ext4 with a bunch of processes stuck
> in D state. This has been seen as far back as 4.13 but is still
> present on 4.14:
> 
[...]
> This looks like everything is blocked on the writeback completing but
> the writeback has been throttled. According to the infra team, this problem
> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> quite match since this seems to be completely stuck. Any suggestions to
> narrow the problem down?

How much dirtyable memory does the system have? We do allow only lowmem
to be dirtyable by default on 32b highmem systems. Maybe you have the
lowmem mostly consumed by the kernel memory. Have you tried to enable
highmem_is_dirtyable?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] aio: Add memcg accounting of user used data

2017-12-06 Thread Michal Hocko
On Wed 06-12-17 15:36:56, Kirill Tkhai wrote:
> On 06.12.2017 15:23, Michal Hocko wrote:
> > On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> > [...]
> >> This meets the problem in case of many containers
> >> are used on the hardware node. Since aio_max_nr is
> >> a global limit, any container may occupy the whole
> >> available aio requests, and to deprive others the
> >> possibility to use aio at all. The situation may
> >> happen because of evil intentions of the container's
> >> user or because of the program error, when the user
> >> makes this occasionally
> > 
> > I am sorry to beat more on this but finally got around to
> > http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86...@virtuozzo.com
> > and read the above paragraph once again. I can see how accounting to
> > a memcg helps to reduce the memory footprint but I fail to see how it
> > helps the above scenario. Could you clarify wow you set up a limit to
> > prevent anybody from DoSing other containers by depleting aio_max_nr?
> The memcg limit allows to increase aio_max_nr and the accounting guarantees
> container can't exceed the limit. You may configure the limit and aio_max_nr
> in the way, all containers requests in sum never reach aio_max_nr.

So you are essentially saying that you make aio_max_nr unlimited and
rely on the memory consumption tracking by memcg, right? Are there any
downsides (e.g. clog the AIO subsytem)?

Please make sure that all this in the changelog.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] aio: Add memcg accounting of user used data

2017-12-06 Thread Michal Hocko
On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
[...]
> This meets the problem in case of many containers
> are used on the hardware node. Since aio_max_nr is
> a global limit, any container may occupy the whole
> available aio requests, and to deprive others the
> possibility to use aio at all. The situation may
> happen because of evil intentions of the container's
> user or because of the program error, when the user
> makes this occasionally

I am sorry to beat more on this but finally got around to
http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86...@virtuozzo.com
and read the above paragraph once again. I can see how accounting to
a memcg helps to reduce the memory footprint but I fail to see how it
helps the above scenario. Could you clarify wow you set up a limit to
prevent anybody from DoSing other containers by depleting aio_max_nr?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] aio: Add memcg accounting of user used data

2017-12-06 Thread Michal Hocko
On Tue 05-12-17 19:02:00, Kirill Tkhai wrote:
> On 05.12.2017 18:43, Michal Hocko wrote:
> > On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
> >> On 05.12.2017 18:15, Michal Hocko wrote:
> >>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> >>>> Currently, number of available aio requests may be
> >>>> limited only globally. There are two sysctl variables
> >>>> aio_max_nr and aio_nr, which implement the limitation
> >>>> and request accounting. They help to avoid
> >>>> the situation, when all the memory is eaten in-flight
> >>>> requests, which are written by slow block device,
> >>>> and which can't be reclaimed by shrinker.
> >>>>
> >>>> This meets the problem in case of many containers
> >>>> are used on the hardware node. Since aio_max_nr is
> >>>> a global limit, any container may occupy the whole
> >>>> available aio requests, and to deprive others the
> >>>> possibility to use aio at all. The situation may
> >>>> happen because of evil intentions of the container's
> >>>> user or because of the program error, when the user
> >>>> makes this occasionally
> >>>>
> >>>> The patch allows to fix the problem. It adds memcg
> >>>> accounting of user used aio data (the biggest is
> >>>> the bunch of aio_kiocb; ring buffer is the second
> >>>> biggest), so a user of a certain memcg won't be able
> >>>> to allocate more aio requests memory, then the cgroup
> >>>> allows, and he will bumped into the limit.
> >>>
> >>> So what happens when we hit the hard limit and oom kill somebody?
> >>> Are those charged objects somehow bound to a process context?
> >>
> >> There is exit_aio() called from __mmput(), which waits till
> >> the charged objects complete and decrement reference counter.
> > 
> > OK, so it is bound to _a_ process context. The oom killer will not know
> > about which process has consumed those objects but the effect will be at
> > least reduced to a memcg.
> > 
> >> If there was a problem with oom in memcg, there would be
> >> the same problem on global oom, as it can be seen there is
> >> no __GFP_NOFAIL flags anywhere in aio code.
> >>
> >> But it seems everything is safe.
> > 
> > Could you share your testing scenario and the way how the system behaved
> > during a heavy aio?
> > 
> > I am not saying the patch is wrong, I am just trying to undestand all
> > the consequences.
> 
> My test is simple program, which creates aio context and then starts
> infinity io_submit() cycle. I've tested the cases, when certain stages
> fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets
> oom. This was simply tested by inserting sleep() before the stage, and
> moving the task to appropriate cgroup with low memory limit. The most
> cases, I get bash killed (I moved it to cgroup too). Also, I've executed
> the test in parallel.
> 
> If you want I can send you the source code, but I don't think it will be
> easy to use it if you are not the author.

Well, not really, I was merely interest about the testing scenario
mainly to see how the system behaved because memcg hitting the hard
limit will OOM kill something only if the failing charge is from the
page fault path. All kernel allocations therefore return with ENOMEM.
The fact we are not considering per task charged kernel memory and
therefore a small task constantly allocating kernel memory can put the
whole cgroup down. As I've said this is something that _should_ be OK
because the bad behavior is isolated within the cgroup.

If that is something that is expected behavior for your usecase then OK.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] aio: Add memcg accounting of user used data

2017-12-05 Thread Michal Hocko
On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
> On 05.12.2017 18:15, Michal Hocko wrote:
> > On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> >> Currently, number of available aio requests may be
> >> limited only globally. There are two sysctl variables
> >> aio_max_nr and aio_nr, which implement the limitation
> >> and request accounting. They help to avoid
> >> the situation, when all the memory is eaten in-flight
> >> requests, which are written by slow block device,
> >> and which can't be reclaimed by shrinker.
> >>
> >> This meets the problem in case of many containers
> >> are used on the hardware node. Since aio_max_nr is
> >> a global limit, any container may occupy the whole
> >> available aio requests, and to deprive others the
> >> possibility to use aio at all. The situation may
> >> happen because of evil intentions of the container's
> >> user or because of the program error, when the user
> >> makes this occasionally
> >>
> >> The patch allows to fix the problem. It adds memcg
> >> accounting of user used aio data (the biggest is
> >> the bunch of aio_kiocb; ring buffer is the second
> >> biggest), so a user of a certain memcg won't be able
> >> to allocate more aio requests memory, then the cgroup
> >> allows, and he will bumped into the limit.
> > 
> > So what happens when we hit the hard limit and oom kill somebody?
> > Are those charged objects somehow bound to a process context?
> 
> There is exit_aio() called from __mmput(), which waits till
> the charged objects complete and decrement reference counter.

OK, so it is bound to _a_ process context. The oom killer will not know
about which process has consumed those objects but the effect will be at
least reduced to a memcg.

> If there was a problem with oom in memcg, there would be
> the same problem on global oom, as it can be seen there is
> no __GFP_NOFAIL flags anywhere in aio code.
> 
> But it seems everything is safe.

Could you share your testing scenario and the way how the system behaved
during a heavy aio?

I am not saying the patch is wrong, I am just trying to undestand all
the consequences.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] aio: Add memcg accounting of user used data

2017-12-05 Thread Michal Hocko
On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> Currently, number of available aio requests may be
> limited only globally. There are two sysctl variables
> aio_max_nr and aio_nr, which implement the limitation
> and request accounting. They help to avoid
> the situation, when all the memory is eaten in-flight
> requests, which are written by slow block device,
> and which can't be reclaimed by shrinker.
> 
> This meets the problem in case of many containers
> are used on the hardware node. Since aio_max_nr is
> a global limit, any container may occupy the whole
> available aio requests, and to deprive others the
> possibility to use aio at all. The situation may
> happen because of evil intentions of the container's
> user or because of the program error, when the user
> makes this occasionally
> 
> The patch allows to fix the problem. It adds memcg
> accounting of user used aio data (the biggest is
> the bunch of aio_kiocb; ring buffer is the second
> biggest), so a user of a certain memcg won't be able
> to allocate more aio requests memory, then the cgroup
> allows, and he will bumped into the limit.

So what happens when we hit the hard limit and oom kill somebody?
Are those charged objects somehow bound to a process context?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/4] lockdep/crossrelease: Apply crossrelease to page locks

2017-12-05 Thread Michal Hocko
On Mon 04-12-17 23:45:17, Matthew Wilcox wrote:
> On Tue, Dec 05, 2017 at 03:19:46PM +0900, Byungchul Park wrote:
> > On 12/5/2017 2:46 PM, Byungchul Park wrote:
> > > On 12/5/2017 2:30 PM, Matthew Wilcox wrote:
> > > > On Mon, Dec 04, 2017 at 02:16:19PM +0900, Byungchul Park wrote:
> > > > > For now, wait_for_completion() / complete() works with lockdep, add
> > > > > lock_page() / unlock_page() and its family to lockdep support.
> > > > > 
> > > > > Changes from v1
> > > > >   - Move lockdep_map_cross outside of page_ext to make it flexible
> > > > >   - Prevent allocating lockdep_map per page by default
> > > > >   - Add a boot parameter allowing the allocation for debugging
> > > > > 
> > > > > Byungchul Park (4):
> > > > >    lockdep: Apply crossrelease to PG_locked locks
> > > > >    lockdep: Apply lock_acquire(release) on __Set(__Clear)PageLocked
> > > > >    lockdep: Move data of CONFIG_LOCKDEP_PAGELOCK from page to page_ext
> > > > >    lockdep: Add a boot parameter enabling to track page locks using
> > > > >  lockdep and disable it by default
> > > > 
> > > > I don't like the way you've structured this patch series; first adding
> > > > the lockdep map to struct page, then moving it to page_ext.
> > > 
> > > Hello,
> > > 
> > > I will make them into one patch.
> > 
> > I've thought it more.
> > 
> > Actually I did it because I thought I'd better make it into two
> > patches since it makes reviewers easier to review. It doesn't matter
> > which one I choose, but I prefer to split it.
> 
> I don't know whether it's better to make it all one patch or split it
> into multiple patches.  But it makes no sense to introduce it in struct
> page, then move it to struct page_ext.

I would tend to agree. It is not like anybody would like to apply only
the first part alone. Adding the necessary infrastructure to page_ext is
not such a big deal.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-24 Thread Michal Hocko
On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > for each struct page. So you are doubling the size. Who is going to
> > > > enable this config option? You are moving this to page_ext in a later
> > > > patch which is a good step but it doesn't go far enough because this
> > > > still consumes those resources. Is there any problem to make this
> > > > kernel command line controllable? Something we do for page_owner for
> > > > example?
> > > 
> > > Sure. I will add it.
> > > 
> > > > Also it would be really great if you could give us some measures about
> > > > the runtime overhead. I do not expect it to be very large but this is
> > > 
> > > The major overhead would come from the amount of additional memory
> > > consumption for 'lockdep_map's.
> > 
> > yes
> > 
> > > Do you want me to measure the overhead by the additional memory
> > > consumption?
> > > 
> > > Or do you expect another overhead?
> > 
> > I would be also interested how much impact this has on performance. I do
> > not expect it would be too large but having some numbers for cache cold
> > parallel kbuild or other heavy page lock workloads.
> 
> Hello Michal,
> 
> I measured 'cache cold parallel kbuild' on my qemu machine. The result
> varies much so I cannot confirm, but I think there's no meaningful
> difference between before and after applying crossrelease to page locks.
> 
> Actually, I expect little overhead in lock_page() and unlock_page() even
> after applying crossreleas to page locks, but only expect a bit overhead
> by additional memory consumption for 'lockdep_map's per page.
> 
> I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> 
>make clean
>echo 3 > drop_caches
>time make -j4

Maybe FS people will help you find a more representative workload. E.g.
linear cache cold file read should be good as well. Maybe there are some
tests in fstests (or how they call xfstests these days).

> The results are:
> 
># w/o page lock tracking
> 
>At the 1st try,
>real 5m28.105s
>user 17m52.716s
>sys  3m8.871s
> 
>At the 2nd try,
>real 5m27.023s
>user 17m50.134s
>sys  3m9.289s
> 
>At the 3rd try,
>real 5m22.837s
>user 17m34.514s
>sys  3m8.097s
> 
># w/ page lock tracking
> 
>At the 1st try,
>real 5m18.158s
>user 17m18.200s
>sys  3m8.639s
> 
>At the 2nd try,
>real 5m19.329s
>user 17m19.982s
>sys  3m8.345s
> 
>At the 3rd try,
>real 5m19.626s
>user 17m21.363s
>sys  3m9.869s
> 
> I think thers's no meaningful difference on my small machine.

Yeah, this doesn't seem to indicate anything. Maybe moving the build to
shmem to rule out IO cost would tell more. But as I've said previously
page I do not really expect this would be very visible. It was more a
matter of my curiosity than an acceptance requirement. I think it is
much more important to make this runtime configurable because almost
nobody is going to enable the feature if it is only build time. The cost
is jut too high.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-16 Thread Michal Hocko
On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > for each struct page. So you are doubling the size. Who is going to
> > enable this config option? You are moving this to page_ext in a later
> > patch which is a good step but it doesn't go far enough because this
> > still consumes those resources. Is there any problem to make this
> > kernel command line controllable? Something we do for page_owner for
> > example?
> 
> Sure. I will add it.
> 
> > Also it would be really great if you could give us some measures about
> > the runtime overhead. I do not expect it to be very large but this is
> 
> The major overhead would come from the amount of additional memory
> consumption for 'lockdep_map's.

yes

> Do you want me to measure the overhead by the additional memory
> consumption?
> 
> Or do you expect another overhead?

I would be also interested how much impact this has on performance. I do
not expect it would be too large but having some numbers for cache cold
parallel kbuild or other heavy page lock workloads.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-16 Thread Michal Hocko
[I have only briefly looked at patches so I might have missed some
details.]

On Thu 16-11-17 12:14:25, Byungchul Park wrote:
> Although lock_page() and its family can cause deadlock, lockdep have not
> worked with them, because unlock_page() might be called in a different
> context from the acquire context, which violated lockdep's assumption.
>
> Now CONFIG_LOCKDEP_CROSSRELEASE has been introduced, lockdep can work
> with page locks.

I definitely agree that debugging page_lock deadlocks is a major PITA
but your implementation seems prohibitively too expensive.

[...]
> @@ -218,6 +222,10 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   int _last_cpupid;
>  #endif
> +
> +#ifdef CONFIG_LOCKDEP_PAGELOCK
> + struct lockdep_map_cross map;
> +#endif
>  }

now you are adding 
struct lockdep_map_cross {
struct lockdep_map map;  /* 040 */
struct cross_lock  xlock;/*4056 */
/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */

/* size: 96, cachelines: 2, members: 2 */
/* last cacheline: 32 bytes */
};

for each struct page. So you are doubling the size. Who is going to
enable this config option? You are moving this to page_ext in a later
patch which is a good step but it doesn't go far enough because this
still consumes those resources. Is there any problem to make this
kernel command line controllable? Something we do for page_owner for
example?

Also it would be really great if you could give us some measures about
the runtime overhead. I do not expect it to be very large but this is
something people are usually interested in when enabling debugging
features.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-06 Thread Michal Hocko
On Thu 06-04-17 09:33:44, Adrian Hunter wrote:
> On 05/04/17 14:39, Vlastimil Babka wrote:
> > On 04/05/2017 01:36 PM, Richard Weinberger wrote:
> >> Michal,
> >>
> >> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
> >>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
> >>>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
> >>>> setting and clearing of PF_MEMALLOC. Replace them by the new generic 
> >>>> helpers.
> >>>> No functional change.
> >>>
> >>> This one smells like an abuser. Why the hell should read/write path
> >>> touch memory reserves at all!
> >>
> >> Could be. Let's ask Adrian, AFAIK he wrote that code.
> >> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> > 
> > I was thinking about it and concluded that since the simulator can be
> > used as a block device where reclaimed pages go to, writing the data out
> > is a memalloc operation. Then reading can be called as part of r-m-w
> > cycle, so reading as well.
> 
> IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
> and memory reclaim waiting on nandsim.

I've got lost in the indirection. Could you describe how would reclaim
get stuck waiting on these paths please?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.

2017-04-05 Thread Michal Hocko
On Thu 06-04-17 12:23:51, NeilBrown wrote:
[...]
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..95679d988725 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -847,10 +847,12 @@ static void loop_unprepare_queue(struct loop_device *lo)
>  static int loop_prepare_queue(struct loop_device *lo)
>  {
>   kthread_init_worker(&lo->worker);
> - lo->worker_task = kthread_run(kthread_worker_fn,
> + lo->worker_task = kthread_create(kthread_worker_fn,
>   &lo->worker, "loop%d", lo->lo_number);
>   if (IS_ERR(lo->worker_task))
>   return -ENOMEM;
> + lo->worker_task->flags |= PF_LESS_THROTTLE;
> + wake_up_process(lo->worker_task);
>   set_user_nice(lo->worker_task, MIN_NICE);
>   return 0;

This should work for the current implementation because kthread_create
will return only after the full initialization has been done. No idea
whether we can rely on that in future. I also think it would be cleaner
to set the flag on current and keep the current semantic that only
current changes its flags.

So while I do not have a strong opinion on this I think defining loop
specific thread function which set PF_LESS_THROTTLE as the first thing
is more elegant and less error prone longerm. A short comment explaining
why we use the flag there would be also preferred.

I will leave the decision to you.

Thanks.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 13:39:16, Vlastimil Babka wrote:
> On 04/05/2017 01:36 PM, Richard Weinberger wrote:
> > Michal,
> > 
> > Am 05.04.2017 um 13:31 schrieb Michal Hocko:
> >> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
> >>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
> >>> setting and clearing of PF_MEMALLOC. Replace them by the new generic 
> >>> helpers.
> >>> No functional change.
> >>
> >> This one smells like an abuser. Why the hell should read/write path
> >> touch memory reserves at all!
> > 
> > Could be. Let's ask Adrian, AFAIK he wrote that code.
> > Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> 
> I was thinking about it and concluded that since the simulator can be
> used as a block device where reclaimed pages go to, writing the data out
> is a memalloc operation. Then reading can be called as part of r-m-w
> cycle, so reading as well. But it would be great if somebody more
> knowledgeable confirmed this.

then this deserves a big fat comment explaining all the details,
including how the complete depletion of reserves is prevented.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
> No functional change.

This one smells like an abuser. Why the hell should read/write path
touch memory reserves at all!

> 
> Signed-off-by: Vlastimil Babka 
> Cc: Boris Brezillon 
> Cc: Richard Weinberger 
> ---
>  drivers/mtd/nand/nandsim.c | 29 +
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
> index cef818f535ed..03a0d057bf2f 100644
> --- a/drivers/mtd/nand/nandsim.c
> +++ b/drivers/mtd/nand/nandsim.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1368,31 +1369,18 @@ static int get_pages(struct nandsim *ns, struct file 
> *file, size_t count, loff_t
>   return 0;
>  }
>  
> -static int set_memalloc(void)
> -{
> - if (current->flags & PF_MEMALLOC)
> - return 0;
> - current->flags |= PF_MEMALLOC;
> - return 1;
> -}
> -
> -static void clear_memalloc(int memalloc)
> -{
> - if (memalloc)
> - current->flags &= ~PF_MEMALLOC;
> -}
> -
>  static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, 
> size_t count, loff_t pos)
>  {
>   ssize_t tx;
> - int err, memalloc;
> + int err;
> + unsigned int noreclaim_flag;
>  
>   err = get_pages(ns, file, count, pos);
>   if (err)
>   return err;
> - memalloc = set_memalloc();
> + noreclaim_flag = memalloc_noreclaim_save();
>   tx = kernel_read(file, pos, buf, count);
> - clear_memalloc(memalloc);
> + memalloc_noreclaim_restore(noreclaim_flag);
>   put_pages(ns);
>   return tx;
>  }
> @@ -1400,14 +1388,15 @@ static ssize_t read_file(struct nandsim *ns, struct 
> file *file, void *buf, size_
>  static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, 
> size_t count, loff_t pos)
>  {
>   ssize_t tx;
> - int err, memalloc;
> + int err;
> + unsigned int noreclaim_flag;
>  
>   err = get_pages(ns, file, count, pos);
>   if (err)
>   return err;
> - memalloc = set_memalloc();
> + noreclaim_flag = memalloc_noreclaim_save();
>   tx = kernel_write(file, buf, count, pos);
> - clear_memalloc(memalloc);
> + memalloc_noreclaim_restore(noreclaim_flag);
>   put_pages(ns);
>   return tx;
>  }
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 09:46:58, Vlastimil Babka wrote:
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that 
> support
> proper nesting by saving the previous stat of the flag, similar to the 
> existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it 
> more
> robust.
> 
> Suggested-by: Michal Hocko 
> Signed-off-by: Vlastimil Babka 

One could argue that tsk_restore_flags() could be extended to provide
tsk_set_flags() and use it for all allocation related PF flags. I do not
have a strong opinion on that but explicit API sounds a bit better to me
because is easier to follow (at least for me). If others think that
generic API would be better then I won't have any objections. Anyway
this looks good to me.
Acked-by: Michal Hocko 

> ---
>  include/linux/sched/mm.h | 12 
>  mm/page_alloc.c  | 11 ++-
>  mm/vmscan.c  | 17 +++--
>  3 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 9daabe138c99..2b24a6974847 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -191,4 +191,16 @@ static inline void memalloc_nofs_restore(unsigned int 
> flags)
>   current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
>  }
>  
> +static inline unsigned int memalloc_noreclaim_save(void)
> +{
> + unsigned int flags = current->flags & PF_MEMALLOC;
> + current->flags |= PF_MEMALLOC;
> + return flags;
> +}
> +
> +static inline void memalloc_noreclaim_restore(unsigned int flags)
> +{
> + current->flags = (current->flags & ~PF_MEMALLOC) | flags;
> +}
> +
>  #endif /* _LINUX_SCHED_MM_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b84e6ffbe756..037e32dccd7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,15 +3288,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   enum compact_priority prio, enum compact_result *compact_result)
>  {
>   struct page *page;
> - unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> + unsigned int noreclaim_flag;
>  
>   if (!order)
>   return NULL;
>  
> - current->flags |= PF_MEMALLOC;
> + noreclaim_flag = memalloc_noreclaim_save();
>   *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>   prio);
> - current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> + memalloc_noreclaim_restore(noreclaim_flag);
>  
>   if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> @@ -3443,12 +3443,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
>  {
>   struct reclaim_state reclaim_state;
>   int progress;
> + unsigned int noreclaim_flag;
>  
>   cond_resched();
>  
>   /* We now go into synchronous reclaim */
>   cpuset_memory_pressure_bump();
> - current->flags |= PF_MEMALLOC;
> + noreclaim_flag = memalloc_noreclaim_save();
>   lockdep_set_current_reclaim_state(gfp_mask);
>   reclaim_state.reclaimed_slab = 0;
>   current->reclaim_state = &reclaim_state;
> @@ -3458,7 +3459,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>   current->reclaim_state = NULL;
>   lockdep_clear_current_reclaim_state();
> - current->flags &= ~PF_MEMALLOC;
> + memalloc_noreclaim_restore(noreclaim_flag);
>  
>   cond_resched();
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 58615bb27f2f..ff63b91a0f48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2992,6 +2992,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
> mem_cgroup *memcg,
>   struct zonelist *zonelist;
>   unsigned long nr_reclaimed;
>   int nid;
> + unsigned int noreclaim_flag;
>   struct scan_control sc = {
>   .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>   .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> @@ -3018,9 +3019,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
> mem_cgroup *memcg,
>   sc.gfp_mask,
>   sc.reclaim_idx);
>  
> - current->flags

Re: [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
> clearing of PF_MEMALLOC. Let's convert the code which was using the generic
> tsk_restore_flags(). No functional change.

It would be really great to revisit why those places outside of the mm
proper really need this flag. I know this is a painful exercise but I
wouldn't be surprised if there were abusers there.

> Signed-off-by: Vlastimil Babka 
> Cc: Josef Bacik 
> Cc: Lee Duncan 
> Cc: Chris Leech 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 

Acked-by: Michal Hocko 

> ---
>  drivers/block/nbd.c  | 7 ---
>  drivers/scsi/iscsi_tcp.c | 7 ---
>  net/core/dev.c   | 7 ---
>  net/core/sock.c  | 7 ---
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 03ae72985c79..929fc548c7fb 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -210,7 +211,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, 
> int send,
>   struct socket *sock = nbd->socks[index]->sock;
>   int result;
>   struct msghdr msg;
> - unsigned long pflags = current->flags;
> + unsigned int noreclaim_flag;
>  
>   if (unlikely(!sock)) {
>   dev_err_ratelimited(disk_to_dev(nbd->disk),
> @@ -221,7 +222,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, 
> int send,
>  
>   msg.msg_iter = *iter;
>  
> - current->flags |= PF_MEMALLOC;
> + noreclaim_flag = memalloc_noreclaim_save();
>   do {
>   sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
>   msg.msg_name = NULL;
> @@ -244,7 +245,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, 
> int send,
>   *sent += result;
>   } while (msg_data_left(&msg));
>  
> - tsk_restore_flags(current, pflags, PF_MEMALLOC);
> + memalloc_noreclaim_restore(noreclaim_flag);
>  
>   return result;
>  }
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index 4228aba1f654..4842fc0e809d 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -371,10 +372,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct 
> iscsi_conn *conn)
>  static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
>  {
>   struct iscsi_conn *conn = task->conn;
> - unsigned long pflags = current->flags;
> + unsigned int noreclaim_flag;
>   int rc = 0;
>  
> - current->flags |= PF_MEMALLOC;
> + noreclaim_flag = memalloc_noreclaim_save();
>  
>   while (iscsi_sw_tcp_xmit_qlen(conn)) {
>   rc = iscsi_sw_tcp_xmit(conn);
> @@ -387,7 +388,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
>   rc = 0;
>   }
>  
> - tsk_restore_flags(current, pflags, PF_MEMALLOC);
> + memalloc_noreclaim_restore(noreclaim_flag);
>   return rc;
>  }
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fde8b3f7136b..e0705a126b24 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -81,6 +81,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -4227,7 +4228,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   int ret;
>  
>   if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
> - unsigned long pflags = current->flags;
> + unsigned int noreclaim_flag;
>  
>   /*
>* PFMEMALLOC skbs are special, they should
> @@ -4238,9 +4239,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>* Use PF_MEMALLOC as this saves us from propagating the 
> allocation
>* context down to all allocation sites.
>*/
> - current->flags |= PF_MEMALLOC;
> + noreclaim_flag = memalloc_noreclaim_save();
>   ret = __netif_receive_skb_core(skb, true);
> - tsk_restore_flags(current, pflags, PF_MEMALLOC);
> + memalloc_noreclaim_restore(noreclaim_flag);
>   } else
>   ret = __netif_receive_skb_core(skb, false);
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 392f9b6f96e2..0b2d06b4c308 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -102,6 +102,7 @@
>  #inc

Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 09:46:57, Vlastimil Babka wrote:
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear 
> a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
> in slowpath")
> Reported-by: Andrey Ryabinin 
> Signed-off-by: Vlastimil Babka 
> Cc: 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   enum compact_priority prio, enum compact_result *compact_result)
>  {
>   struct page *page;
> + unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>  
>   if (!order)
>   return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   current->flags |= PF_MEMALLOC;
>   *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>   prio);
> - current->flags &= ~PF_MEMALLOC;
> + current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
>  
>   if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 09:19:27, Michal Hocko wrote:
> On Wed 05-04-17 14:33:50, NeilBrown wrote:
[...]
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 0ecb6461ed81..44b3506fd086 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
> > if (IS_ERR(lo->worker_task))
> > return -ENOMEM;
> > set_user_nice(lo->worker_task, MIN_NICE);
> > +   lo->worker_task->flags |= PF_LESS_THROTTLE;
> > return 0;
> 
> As mentioned elsewhere, PF flags should be updated only on the current
> task otherwise there is potential rmw race. Is this safe? The code runs
> concurrently with the worker thread.

I believe you need something like this instead
---
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f347285c67ec..07b2a909e4fb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -844,10 +844,16 @@ static void loop_unprepare_queue(struct loop_device *lo)
kthread_stop(lo->worker_task);
 }
 
+int loop_kthread_worker_fn(void *worker_ptr)
+{
+   current->flags |= PF_LESS_THROTTLE;
+   return kthread_worker_fn(worker_ptr);
+}
+
 static int loop_prepare_queue(struct loop_device *lo)
 {
kthread_init_worker(&lo->worker);
-   lo->worker_task = kthread_run(kthread_worker_fn,
+   lo->worker_task = kthread_run(loop_kthread_worker_fn,
&lo->worker, "loop%d", lo->lo_number);
if (IS_ERR(lo->worker_task))
return -ENOMEM;
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.

2017-04-05 Thread Michal Hocko
On Wed 05-04-17 14:33:50, NeilBrown wrote:
> 
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
> 
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
> 
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
> 
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
> 
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
> 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Michal Hocko 
> Signed-off-by: NeilBrown 
> ---
> 
> I moved where the flag is set, thanks to suggestion from
> Ming Lei.
> I've preserved the *-by: tags I was offered despite the code
> being different, as the concept is identical.
> 
> Thanks,
> NeilBrown
> 
> 
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..44b3506fd086 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
>   if (IS_ERR(lo->worker_task))
>   return -ENOMEM;
>   set_user_nice(lo->worker_task, MIN_NICE);
> + lo->worker_task->flags |= PF_LESS_THROTTLE;
>   return 0;

As mentioned elsewhere, PF flags should be updated only on the current
task otherwise there is potential rmw race. Is this safe? The code runs
concurrently with the worker thread.


-- 
Michal Hocko
SUSE Labs


Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.

2017-04-04 Thread Michal Hocko
On Mon 03-04-17 11:18:51, NeilBrown wrote:
> 
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
> 
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
> 
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
> 
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
> 
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.

Yes this makes sense to me

> Signed-off-by: NeilBrown 

Acked-by: Michal Hocko 

one nit below

> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>   struct loop_cmd *cmd =
>   container_of(work, struct loop_cmd, work);
> + int oldflags = current->flags & PF_LESS_THROTTLE;
>  
> + current->flags |= PF_LESS_THROTTLE;
>   loop_handle_cmd(cmd);
> + current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;

we have a helper for this tsk_restore_flags(). It is not used
consistently and maybe we want a dedicated api like we have for the
scope NOIO/NOFS but that is a separate thing. I would find
tsk_restore_flags easier to read.

-- 
Michal Hocko
SUSE Labs


Re: LSF/MM 2017: Call for Proposals

2016-12-08 Thread Michal Hocko
On Thu 08-12-16 07:30:43, James Bottomley wrote:
> On Thu, 2016-12-08 at 13:26 +0100, Michal Hocko wrote:
> > On Wed 07-12-16 06:57:06, James Bottomley wrote:
> > [...]
> > > Just on this point, since there seems to be a lot of confusion: lsf
> > > -pc
> > > is the list for contacting the programme committee, so you cannot
> > > subscribe to it.
> > > 
> > > There is no -discuss equivalent, like kernel summit has, because we
> > > expect you to cc the relevant existing mailing list and have the
> > > discussion there instead rather than expecting people to subscribe
> > > to a
> > > new list.
> > 
> > There used to be l...@lists.linux-foundation.org. Is it not active
> > anymore?
> 
> Not until we get the list of invitees sorted out.  And then it's only
> for stuff actually to do with lsf (like logistics, bof or session
> requests or other meetups).  Any technical stuff should still be cc'd
> to the relevant mailing list.

OK, I tend to forget about that. Thanks for the clarification!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LSF/MM 2017: Call for Proposals

2016-12-08 Thread Michal Hocko
On Wed 07-12-16 06:57:06, James Bottomley wrote:
[...]
> Just on this point, since there seems to be a lot of confusion: lsf-pc
> is the list for contacting the programme committee, so you cannot
> subscribe to it.
> 
> There is no -discuss equivalent, like kernel summit has, because we
> expect you to cc the relevant existing mailing list and have the
> discussion there instead rather than expecting people to subscribe to a
> new list.

There used to be l...@lists.linux-foundation.org. Is it not active
anymore?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html