Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-23 Thread Suren Baghdasaryan
On Tue, Dec 22, 2020 at 11:57 PM Christoph Hellwig  wrote:
>
> On Tue, Dec 22, 2020 at 09:48:43AM -0800, Suren Baghdasaryan wrote:
> > Thanks for the feedback! The use case is userspace memory reaping
> > similar to oom-reaper. Detailed justification is here:
> > https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com
>
> Given that this new variant of process_madvise
>
>   a) does not work on an address range

True, however I can see other madvise flavors that could be used on
the entire process. For example process_madvise(MADV_PAGEOUT) could be
used to "shrink" an entire inactive background process.

>   b) is destructive

I agree that memory reaping might be the only case when a destructive
process_madvise() makes sense. Unless the target process is dying, a
destructive process_madvise() would need coordination with the target
process, and if it's coordinated then the target might as well call
normal madvise() itself.

>   c) doesn't share much code at all with the rest of process_madvise

It actually does reuse a considerable part of the code, but the same
code can be refactored and reused either way.

>
> Why not add a proper separate syscall?

I think my answer to (a) is one justification for allowing
process_madvise() to operate on the entire process. Also MADV_DONTNEED
seems quite suitable for this operation.
Considering the above answers, are you still leaning towards a separate syscall?

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Christoph Hellwig
On Tue, Dec 22, 2020 at 09:48:43AM -0800, Suren Baghdasaryan wrote:
> Thanks for the feedback! The use case is userspace memory reaping
> similar to oom-reaper. Detailed justification is here:
> https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com

Given that this new variant of process_madvise

  a) does not work on an address range
  b) is destructive
  c) doesn't share much code at all with the rest of process_madvise

Why not add a proper separate syscall?


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Suren Baghdasaryan
On Tue, Dec 22, 2020 at 9:48 AM Suren Baghdasaryan  wrote:
>
> On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig  wrote:
> >
> > On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > > struct iovec {
> > > > .iov_base = NULL,
> > > > .iov_len = (~(size_t)0),
> > > > };
> > >
> > > In addition to Suren's objections, I think it's also worth considering
> > > how this looks in terms of compat API. If a compat process does
> > > process_madvise() on another compat process, it would be specifying
> > > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > > you'd need special code to catch that case, which would be ugly.
> > >
> > > And when a compat process uses this API on a non-compat process, it
> > > semantically gets really weird: The actual address range covered would
> > > be larger than the address range specified.
> > >
> > > And if we want different access checks for the two flavors in the
> > > future, gating that different behavior on special values in the iovec
> > > would feel too magical to me.
> > >
> > > And the length value SIZE_MAX doesn't really make sense anyway because
> > > the length of the whole address space would be SIZE_MAX+1, which you
> > > can't express.
> > >
> > > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > > a magic number here.
> >
> > Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> > a huge fan of a flag either.  What is the use case for the madvise
> > to all of a processes address space anyway?
>
> Thanks for the feedback! The use case is userspace memory reaping
> similar to oom-reaper. Detailed justification is here:
> https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com

Actually this post in the most informative and includes test results:
https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com/


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Suren Baghdasaryan
On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig  wrote:
>
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> a huge fan of a flag either.  What is the use case for the madvise
> to all of a processes address space anyway?

Thanks for the feedback! The use case is userspace memory reaping
similar to oom-reaper. Detailed justification is here:
https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Christoph Hellwig
On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> > struct iovec {
> > .iov_base = NULL,
> > .iov_len = (~(size_t)0),
> > };
> 
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
> 
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
> 
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
> 
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
> 
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.

Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
a huge fan of a flag either.  What is the use case for the madvise
to all of a processes address space anyway?


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-11 Thread Jann Horn
On Sat, Dec 12, 2020 at 12:01 AM Minchan Kim  wrote:
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > +CC Christoph Hellwig for opinions on compat
> >
> > On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim  wrote:
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Can't we simply pass NULL as iovec as special id, then?

AFAIK in theory NULL can be a valid userspace pointer (although that
basically never happens and, on MMU systems, requires root to
explicitly do it). On the other hand, there are some parts of the UAPI
that already special-case NULL, so maybe that's considered acceptable?

Also, special-casing NULL slightly increases the chance that userspace
messes up and accidentally triggers completely different behavior
because an allocation failed or something like that.

So while I'm not strongly against using NULL here, I don't exactly
like the idea.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-11 Thread Minchan Kim
On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> +CC Christoph Hellwig for opinions on compat
> 
> On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim  wrote:
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> > struct iovec {
> > .iov_base = NULL,
> > .iov_len = (~(size_t)0),
> > };
> 
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
> 
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
> 
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
> 
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
> 
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.

Can't we simply pass NULL as iovec as special id, then?


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-11 Thread Jann Horn
+CC Christoph Hellwig for opinions on compat

On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim  wrote:
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
> struct iovec {
> .iov_base = NULL,
> .iov_len = (~(size_t)0),
> };

In addition to Suren's objections, I think it's also worth considering
how this looks in terms of compat API. If a compat process does
process_madvise() on another compat process, it would be specifying
the maximum 32-bit number, rather than the maximum 64-bit number, so
you'd need special code to catch that case, which would be ugly.

And when a compat process uses this API on a non-compat process, it
semantically gets really weird: The actual address range covered would
be larger than the address range specified.

And if we want different access checks for the two flavors in the
future, gating that different behavior on special values in the iovec
would feel too magical to me.

And the length value SIZE_MAX doesn't really make sense anyway because
the length of the whole address space would be SIZE_MAX+1, which you
can't express.

So I'm in favor of a new flag, and strongly against using SIZE_MAX as
a magic number here.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-07 Thread Suren Baghdasaryan
On Mon, Nov 30, 2020 at 11:01 AM Suren Baghdasaryan  wrote:
>
> On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim  wrote:
> >
> > On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
> > > >
> > > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > > process_madvise requires a vector of address ranges to be provided for
> > > > > its operations. When an advice should be applied to the entire 
> > > > > process,
> > > > > the caller process has to obtain the list of VMAs of the target 
> > > > > process
> > > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > > operation grows linearly with increasing number of VMAs in the target
> > > > > process. Even constructing the input vector can be non-trivial when
> > > > > target process has several thousands of VMAs and the syscall is being
> > > > > issued during high memory pressure period when new allocations for 
> > > > > such
> > > > > a vector would only worsen the situation.
> > > > > In the case when advice is being applied to the entire memory space of
> > > > > the target process, this creates an extra overhead.
> > > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > > advise a memory range of the target process. For now, to keep it 
> > > > > simple,
> > > > > only the entire process memory range is supported, vec and vlen inputs
> > > > > in this mode are ignored and can be NULL and 0.
> > > > > Instead of returning the number of bytes that advice was successfully
> > > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > > to the fact that the number of bytes would not be useful for the 
> > > > > caller
> > > > > that does not know the amount of memory the call is supposed to 
> > > > > affect.
> > > > > Besides, the ssize_t return type can be too small to hold the number 
> > > > > of
> > > > > bytes affected when the operation is applied to a large memory range.
> > > >
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > > struct iovec {
> > > > .iov_base = NULL,
> > > > .iov_len = (~(size_t)0),
> > > > };
> > > >
> > > > Furthermore, it would be applied for other syscalls where have support
> > > > iovec if we agree on it.
> > > >
> > >
> > > The flag also changes the return value semantics. If we follow your
> > > suggestion we should also agree that in this mode the return value
> > > will be 0 on success and negative otherwise instead of the number of
> > > bytes madvise was applied to.
> >
> > Well, return value will depends on the each API. If the operation is
> > desruptive, it should return the right size affected by the API but
> > would be okay with 0 or error, otherwise.
>
> I'm fine with dropping the flag, I just thought with the flag it would
> be more explicit that this is a special mode operating on ranges. This
> way the patch also becomes simpler.
> Andrew, Michal, Christian, what do you think about such API? Should I
> change the API this way / keep the flag / change it in some other way?


Friendly ping to get some feedback on the proposed API please.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-30 Thread Suren Baghdasaryan
On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim  wrote:
>
> On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
> > >
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> > >
> > > Furthermore, it would be applied for other syscalls where have support
> > > iovec if we agree on it.
> > >
> >
> > The flag also changes the return value semantics. If we follow your
> > suggestion we should also agree that in this mode the return value
> > will be 0 on success and negative otherwise instead of the number of
> > bytes madvise was applied to.
>
> Well, return value will depends on the each API. If the operation is
> desruptive, it should return the right size affected by the API but
> would be okay with 0 or error, otherwise.

I'm fine with dropping the flag, I just thought with the flag it would
be more explicit that this is a special mode operating on ranges. This
way the patch also becomes simpler.
Andrew, Michal, Christian, what do you think about such API? Should I
change the API this way / keep the flag / change it in some other way?


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-25 Thread Minchan Kim
On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
> >
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> > struct iovec {
> > .iov_base = NULL,
> > .iov_len = (~(size_t)0),
> > };
> >
> > Furthermore, it would be applied for other syscalls where have support
> > iovec if we agree on it.
> >
> 
> The flag also changes the return value semantics. If we follow your
> suggestion we should also agree that in this mode the return value
> will be 0 on success and negative otherwise instead of the number of
> bytes madvise was applied to.

Well, return value will depends on the each API. If the operation is
desruptive, it should return the right size affected by the API but
would be okay with 0 or error, otherwise.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-25 Thread Suren Baghdasaryan
On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
>
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
> struct iovec {
> .iov_base = NULL,
> .iov_len = (~(size_t)0),
> };
>
> Furthermore, it would be applied for other syscalls where have support
> iovec if we agree on it.
>

The flag also changes the return value semantics. If we follow your
suggestion we should also agree that in this mode the return value
will be 0 on success and negative otherwise instead of the number of
bytes madvise was applied to.

> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  arch/alpha/include/uapi/asm/mman.h   |  4 ++
> >  arch/mips/include/uapi/asm/mman.h|  4 ++
> >  arch/parisc/include/uapi/asm/mman.h  |  4 ++
> >  arch/xtensa/include/uapi/asm/mman.h  |  4 ++
> >  fs/io_uring.c|  2 +-
> >  include/linux/mm.h   |  3 +-
> >  include/uapi/asm-generic/mman-common.h   |  4 ++
> >  mm/madvise.c | 47 +---
> >  tools/include/uapi/asm-generic/mman-common.h |  4 ++
> >  9 files changed, 67 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h 
> > b/arch/alpha/include/uapi/asm/mman.h
> > index a18ec7f63888..54588d2f5406 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -79,4 +79,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __ALPHA_MMAN_H__ */
> > diff --git a/arch/mips/include/uapi/asm/mman.h 
> > b/arch/mips/include/uapi/asm/mman.h
> > index 57dc2ac4f8bd..af94f38a3a9d 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -106,4 +106,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> > +
> >  #endif /* _ASM_MMAN_H */
> > diff --git a/arch/parisc/include/uapi/asm/mman.h 
> > b/arch/parisc/include/uapi/asm/mman.h
> > index ab78cba446ed..ae644c493991 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -77,4 +77,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __PARISC_MMAN_H__ */
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h 
> > b/arch/xtensa/include/uapi/asm/mman.h
> > index e5e643752947..934cdd11abff 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -114,4 +114,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define 

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-25 Thread Minchan Kim
On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> process_madvise requires a vector of address ranges to be provided for
> its operations. When an advice should be applied to the entire process,
> the caller process has to obtain the list of VMAs of the target process
> by reading the /proc/pid/maps or some other way. The cost of this
> operation grows linearly with increasing number of VMAs in the target
> process. Even constructing the input vector can be non-trivial when
> target process has several thousands of VMAs and the syscall is being
> issued during high memory pressure period when new allocations for such
> a vector would only worsen the situation.
> In the case when advice is being applied to the entire memory space of
> the target process, this creates an extra overhead.
> Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> advise a memory range of the target process. For now, to keep it simple,
> only the entire process memory range is supported, vec and vlen inputs
> in this mode are ignored and can be NULL and 0.
> Instead of returning the number of bytes that advice was successfully
> applied to, the syscall in this mode returns 0 on success. This is due
> to the fact that the number of bytes would not be useful for the caller
> that does not know the amount of memory the call is supposed to affect.
> Besides, the ssize_t return type can be too small to hold the number of
> bytes affected when the operation is applied to a large memory range.

Can we just use one element in iovec to indicate entire address rather
than using up the reserved flags?

struct iovec {
.iov_base = NULL,
.iov_len = (~(size_t)0),
};

Furthermore, it would be applied for other syscalls where have support
iovec if we agree on it.

> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  arch/alpha/include/uapi/asm/mman.h   |  4 ++
>  arch/mips/include/uapi/asm/mman.h|  4 ++
>  arch/parisc/include/uapi/asm/mman.h  |  4 ++
>  arch/xtensa/include/uapi/asm/mman.h  |  4 ++
>  fs/io_uring.c|  2 +-
>  include/linux/mm.h   |  3 +-
>  include/uapi/asm-generic/mman-common.h   |  4 ++
>  mm/madvise.c | 47 +---
>  tools/include/uapi/asm-generic/mman-common.h |  4 ++
>  9 files changed, 67 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h 
> b/arch/alpha/include/uapi/asm/mman.h
> index a18ec7f63888..54588d2f5406 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -79,4 +79,8 @@
>  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
>PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> +
>  #endif /* __ALPHA_MMAN_H__ */
> diff --git a/arch/mips/include/uapi/asm/mman.h 
> b/arch/mips/include/uapi/asm/mman.h
> index 57dc2ac4f8bd..af94f38a3a9d 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -106,4 +106,8 @@
>  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
>PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> +
>  #endif /* _ASM_MMAN_H */
> diff --git a/arch/parisc/include/uapi/asm/mman.h 
> b/arch/parisc/include/uapi/asm/mman.h
> index ab78cba446ed..ae644c493991 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -77,4 +77,8 @@
>  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
>PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> +
>  #endif /* __PARISC_MMAN_H__ */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h 
> b/arch/xtensa/include/uapi/asm/mman.h
> index e5e643752947..934cdd11abff 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -114,4 +114,8 @@
>  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
>PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> +
>  #endif /* _XTENSA_MMAN_H */
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index a8c136a1cf4e..508c48b998ee 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4118,7 +4118,7 @@ static int io_madvise(struct io_kiocb *req, bool 
> force_nonblock)
>   if (force_nonblock)
>   return -EAGAIN;
>  
> - ret = do_madvise(current->mm, ma->addr,