Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-11-29 Thread Mike Hommey
On Fri, Nov 02, 2012 at 09:59:07PM +0100, Michael Kerrisk wrote:
> John,
> 
> A question at on one point:
> 
> On Wed, Oct 3, 2012 at 12:38 AM, John Stultz  wrote:
> > On 10/02/2012 12:39 AM, NeilBrown wrote:
> [...]
> >>   The SIGBUS interface could have some merit if it really reduces
> >> overhead.  I
> >>   worry about app bugs that could result from the non-deterministic
> >>   behaviour.   A range could get unmapped while it is in use and testing
> >> for
> >>   the case of "get a SIGBUS half way though accessing something" would not
> >>   be straight forward (SIGBUS on first step of access should be easy).
> >>   I guess that is up to the app writer, but I have never liked anything
> >> about
> >>   the signal interface and encouraging further use doesn't feel wise.
> >
> > Initially I didn't like the idea, but have warmed considerably to it. Mainly
> > due to the concern that the constant unmark/access/mark pattern would be too
> > much overhead, and having a lazy method will be much nicer for performance.
> > But yes, at the cost of additional complexity of handling the signal,
> > marking the faulted address range as non-volatile, restoring the data and
> > continuing.
> 
> At a finer level of detail, how do you see this as happening in the
> application. I mean: in the general case, repopulating the purged
> volatile page would have to be done outside the signal handler (I
> think, because async-signal-safety considerations would preclude too
> much compdex stuff going on inside the handler). That implies
> longjumping out of the handler, repopulating the pages with data, and
> then restarting whatever work was being done when the SIGBUS was
> generated.

There are different strategies that can be used to repopulate the pages,
within or outside the signal handler, and I'd say it's not that
important of a detail.

That being said, if the kernel could be helpful and avoid people
shooting themselves in the foot, that would be great, too.

I don't know how possible this would be but being able to get the
notification on a signalfd in a dedicated thread would certainly improve
things (I guess other usecases of SIGSEGV/SIGBUG handlers could
appreciate something like this). The kernel would pause the faulting
thread while sending the notification on the signalfd, and the notified
thread would be allowed to resume the faulting thread when it's done
doing its job.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-11-29 Thread Mike Hommey
On Fri, Nov 02, 2012 at 09:59:07PM +0100, Michael Kerrisk wrote:
 John,
 
 A question at on one point:
 
 On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote:
  On 10/02/2012 12:39 AM, NeilBrown wrote:
 [...]
The SIGBUS interface could have some merit if it really reduces
  overhead.  I
worry about app bugs that could result from the non-deterministic
behaviour.   A range could get unmapped while it is in use and testing
  for
the case of get a SIGBUS half way though accessing something would not
be straight forward (SIGBUS on first step of access should be easy).
I guess that is up to the app writer, but I have never liked anything
  about
the signal interface and encouraging further use doesn't feel wise.
 
  Initially I didn't like the idea, but have warmed considerably to it. Mainly
  due to the concern that the constant unmark/access/mark pattern would be too
  much overhead, and having a lazy method will be much nicer for performance.
  But yes, at the cost of additional complexity of handling the signal,
  marking the faulted address range as non-volatile, restoring the data and
  continuing.
 
 At a finer level of detail, how do you see this as happening in the
 application. I mean: in the general case, repopulating the purged
 volatile page would have to be done outside the signal handler (I
 think, because async-signal-safety considerations would preclude too
 much compdex stuff going on inside the handler). That implies
 longjumping out of the handler, repopulating the pages with data, and
 then restarting whatever work was being done when the SIGBUS was
 generated.

There are different strategies that can be used to repopulate the pages,
within or outside the signal handler, and I'd say it's not that
important of a detail.

That being said, if the kernel could be helpful and avoid people
shooting themselves in the foot, that would be great, too.

I don't know how possible this would be but being able to get the
notification on a signalfd in a dedicated thread would certainly improve
things (I guess other usecases of SIGSEGV/SIGBUG handlers could
appreciate something like this). The kernel would pause the faulting
thread while sending the notification on the signalfd, and the notified
thread would be allowed to resume the faulting thread when it's done
doing its job.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-11-03 Thread Michael Kerrisk
[CC += linux-api, since this is an API change.]

Hi John,

A couple of other questions that occurred to me...

What are the expected/planned semantics of volatile ranges for mlocked
pages? I noticed that Minchan's patch series
(https://lwn.net/Articles/522154/) gives an error on attempt to mark
locked pages as volatile (which seems sensible). I didn't see anything
similar in your patches. Perhaps it's not easy to do because of the
non-VMA-based implementation? Something to think about.

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz  wrote:
> On 10/02/2012 12:39 AM, NeilBrown wrote:
>>
>> On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz 
>> wrote:
>>
>>   For example, allowing sub-page volatile region seems to be above and
>> beyond
>>   the call of duty.  You cannot mmap sub-pages, so why should they be
>> volatile?
>
> Although if someone marked a page and a half as volatile, would it be
> reasonable to throw away the second half of that second page? That seems
> unexpected to me. So we're really only marking the whole pages specified as
> volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.
>
> But if it happens that the adjacent range is also a partial page, we can
> coalesce them possibly into an purgable whole page. I think it makes sense,
> especially from a userland point of view and wasn't really complicated to
> add.

I must confess that I'm puzzled by this facility to lock sub-page
range ranges as well. What's the use case? What I'm thinking is: the
goal of volatile ranges is to help improve system performance by
freeing up a (sizeable) block of pages. Why then would the user care
too much about marking with sub-page granularity, or that such ranges
might be merged? After all, the system calls to do this marking are
expensive, and so for performance reasons, I suppose that a process
would like to keep those system calls to a minimum.

[...]

>>   I think discarding whole ranges at a time is very sensible, and so
>> merging
>>   adjacent ranges is best avoided.  If you require page-aligned ranges
>> this
>>   becomes trivial - is that right?
>
> True. If we avoid coalescing non-whole page ranges, keeping non-overlapping
> ranges independent is fairly easy.

Regarding coalescing of adjacent ranges. Here's one possible argument
against it (Jake Edge alerted me to this). If an application marked
adjacent ranges using separate system calls, that might be an
indication that the application intends to to have different access
patterns against the two ranges: one frequent, the other rare. In that
case, I suppose it would be better if the ranges were not merged.

Cheers,

Michael

-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-11-03 Thread Michael Kerrisk
[CC += linux-api, since this is an API change.]

Hi John,

A couple of other questions that occurred to me...

What are the expected/planned semantics of volatile ranges for mlocked
pages? I noticed that Minchan's patch series
(https://lwn.net/Articles/522154/) gives an error on attempt to mark
locked pages as volatile (which seems sensible). I didn't see anything
similar in your patches. Perhaps it's not easy to do because of the
non-VMA-based implementation? Something to think about.

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote:
 On 10/02/2012 12:39 AM, NeilBrown wrote:

 On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org
 wrote:

   For example, allowing sub-page volatile region seems to be above and
 beyond
   the call of duty.  You cannot mmap sub-pages, so why should they be
 volatile?

 Although if someone marked a page and a half as volatile, would it be
 reasonable to throw away the second half of that second page? That seems
 unexpected to me. So we're really only marking the whole pages specified as
 volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.

 But if it happens that the adjacent range is also a partial page, we can
 coalesce them possibly into an purgable whole page. I think it makes sense,
 especially from a userland point of view and wasn't really complicated to
 add.

I must confess that I'm puzzled by this facility to lock sub-page
range ranges as well. What's the use case? What I'm thinking is: the
goal of volatile ranges is to help improve system performance by
freeing up a (sizeable) block of pages. Why then would the user care
too much about marking with sub-page granularity, or that such ranges
might be merged? After all, the system calls to do this marking are
expensive, and so for performance reasons, I suppose that a process
would like to keep those system calls to a minimum.

[...]

   I think discarding whole ranges at a time is very sensible, and so
 merging
   adjacent ranges is best avoided.  If you require page-aligned ranges
 this
   becomes trivial - is that right?

 True. If we avoid coalescing non-whole page ranges, keeping non-overlapping
 ranges independent is fairly easy.

Regarding coalescing of adjacent ranges. Here's one possible argument
against it (Jake Edge alerted me to this). If an application marked
adjacent ranges using separate system calls, that might be an
indication that the application intends to to have different access
patterns against the two ranges: one frequent, the other rare. In that
case, I suppose it would be better if the ranges were not merged.

Cheers,

Michael

-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-11-02 Thread Michael Kerrisk
John,

A question at on one point:

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz  wrote:
> On 10/02/2012 12:39 AM, NeilBrown wrote:
[...]
>>   The SIGBUS interface could have some merit if it really reduces
>> overhead.  I
>>   worry about app bugs that could result from the non-deterministic
>>   behaviour.   A range could get unmapped while it is in use and testing
>> for
>>   the case of "get a SIGBUS half way though accessing something" would not
>>   be straight forward (SIGBUS on first step of access should be easy).
>>   I guess that is up to the app writer, but I have never liked anything
>> about
>>   the signal interface and encouraging further use doesn't feel wise.
>
> Initially I didn't like the idea, but have warmed considerably to it. Mainly
> due to the concern that the constant unmark/access/mark pattern would be too
> much overhead, and having a lazy method will be much nicer for performance.
> But yes, at the cost of additional complexity of handling the signal,
> marking the faulted address range as non-volatile, restoring the data and
> continuing.

At a finer level of detail, how do you see this as happening in the
application. I mean: in the general case, repopulating the purged
volatile page would have to be done outside the signal handler (I
think, because async-signal-safety considerations would preclude too
much compdex stuff going on inside the handler). That implies
longjumping out of the handler, repopulating the pages with data, and
then restarting whatever work was being done when the SIGBUS was
generated.

Cheers,

Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-11-02 Thread Michael Kerrisk
John,

A question at on one point:

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote:
 On 10/02/2012 12:39 AM, NeilBrown wrote:
[...]
   The SIGBUS interface could have some merit if it really reduces
 overhead.  I
   worry about app bugs that could result from the non-deterministic
   behaviour.   A range could get unmapped while it is in use and testing
 for
   the case of get a SIGBUS half way though accessing something would not
   be straight forward (SIGBUS on first step of access should be easy).
   I guess that is up to the app writer, but I have never liked anything
 about
   the signal interface and encouraging further use doesn't feel wise.

 Initially I didn't like the idea, but have warmed considerably to it. Mainly
 due to the concern that the constant unmark/access/mark pattern would be too
 much overhead, and having a lazy method will be much nicer for performance.
 But yes, at the cost of additional complexity of handling the signal,
 marking the faulted address range as non-volatile, restoring the data and
 continuing.

At a finer level of detail, how do you see this as happening in the
application. I mean: in the general case, repopulating the purged
volatile page would have to be done outside the signal handler (I
think, because async-signal-safety considerations would preclude too
much compdex stuff going on inside the handler). That implies
longjumping out of the handler, repopulating the pages with data, and
then restarting whatever work was being done when the SIGBUS was
generated.

Cheers,

Michael
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-09 Thread Minchan Kim
On Tue, Oct 09, 2012 at 02:30:03PM -0700, John Stultz wrote:
> On 10/09/2012 01:07 AM, Mike Hommey wrote:
> >Note it doesn't have to be a vs. situation. madvise could be an
> >additional way to interface with volatile ranges on a given fd.
> >
> >That is, madvise doesn't have to mean anonymous memory. As a matter of
> >fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
> >Similarly, there could be a way to use madvise to mark volatile ranges,
> >without the application having to track what memory ranges are
> >associated to what part of what file, which the kernel already tracks.
> 
> Good point. We could add madvise() interface, but limit it only to
> mmapped tmpfs files, in parallel with the fallocate() interface.
> 
> However, I would like to think through how MADV_MARK_VOLATILE with
> purely anonymous memory could work, before starting that approach.
> That and Neil's point that having an identical kernel interface
> restricted to tmpfs, only as a convenience to userland in switching
> from virtual address to/from mmapped file offset may be better left
> to a userland library.

How about this?

The scenario I imagine about madvise semantic following as.

1) Anonymous pages
Assume that there is some allocator library which manage mmaped reserved pool.
If it has lots of free memory which isn't used by anyone, it can unmap part of
reserved pool but unmap isn't cheap because kernel should zap all ptes of the
pages in the range. But if we avoid unmap, VM would swap out that range which
have just garbage unnecessary when memory pressure happens.
If it mark that range volatile, we can avoid unnecessary swap out and even
reclaim them with no swap. Only thing allocator have to do is unmark that range
before allocating to user.

2) File pages(NOT tmpfs)
We can reclaim volatile file pages easily without recycling of LRU
although it is accessed recently.
The difference with DONTNEED is that DONTNEED always move pages to
tail of inactive LRU to reclaim early but VOLATILE semantic leave them
as it is without moving to tail and reclaim them without considering
recently-used when they reach at tail of LRU by aging because they can
be unmarked sooner or later for using and we can't expect cost of
recreating of the object.

So reclaim preference : NORMAL < VOLATILE < DONTNEED


> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-09 Thread John Stultz

On 10/09/2012 01:07 AM, Mike Hommey wrote:

Note it doesn't have to be a vs. situation. madvise could be an
additional way to interface with volatile ranges on a given fd.

That is, madvise doesn't have to mean anonymous memory. As a matter of
fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
Similarly, there could be a way to use madvise to mark volatile ranges,
without the application having to track what memory ranges are
associated to what part of what file, which the kernel already tracks.


Good point. We could add madvise() interface, but limit it only to 
mmapped tmpfs files, in parallel with the fallocate() interface.


However, I would like to think through how MADV_MARK_VOLATILE with 
purely anonymous memory could work, before starting that approach. That 
and Neil's point that having an identical kernel interface restricted to 
tmpfs, only as a convenience to userland in switching from virtual 
address to/from mmapped file offset may be better left to a userland 
library.


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-09 Thread Mike Hommey
On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> fd based interfaces vs madvise:
>   In talking with Taras Glek, he pointed out that for his
>   needs, the fd based interface is a little annoying, as it
>   requires having to get access to tmpfs file and mmap it in,
>   then instead of just referencing a pointer to the data he
>   wants to mark volatile, he has to calculate the offset from
>   start of the mmap and pass those file offsets to the interface.
>   Instead he mentioned that using something like madvise would be
>   much nicer, since they could just pass a pointer to the object
>   in memory they want to make volatile and avoid the extra work.
> 
>   I'm not opposed to adding an madvise interface for this as
>   well, but since we have a existing use case with Android's
>   ashmem, I want to make sure we support this existing behavior.
>   Specifically as with ashmem  applications can be sharing
>   these tmpfs fds, and so file-relative volatile ranges make
>   more sense if you need to coordinate what data is volatile
>   between two applications.
> 
>   Also, while I agree that having an madvise interface for
>   volatile ranges would be nice, it does open up some more
>   complex implementation issues, since with files, there is a
>   fixed relationship between pages and the files' address_space
>   mapping, where you can't have pages shared between different
>   mappings. This makes it easy to hang the volatile-range tree
>   off of the mapping (well, indirectly via a hash table). With
>   general anonymous memory, pages can be shared between multiple
>   processes, and as far as I understand, don't have any grouping
>   structure we could use to determine if the page is in a
>   volatile range or not. We would also need to determine more
>   complex questions like: What are the semantics of volatility
>   with copy-on-write pages?  I'm hoping to investigate this
>   idea more deeply soon so I can be sure whatever is pushed has
>   a clear plan of how to address this idea. Further thoughts
>   here would be appreciated.

Note it doesn't have to be a vs. situation. madvise could be an
additional way to interface with volatile ranges on a given fd.

That is, madvise doesn't have to mean anonymous memory. As a matter of
fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
Similarly, there could be a way to use madvise to mark volatile ranges,
without the application having to track what memory ranges are
associated to what part of what file, which the kernel already tracks.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-09 Thread Mike Hommey
On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
 fd based interfaces vs madvise:
   In talking with Taras Glek, he pointed out that for his
   needs, the fd based interface is a little annoying, as it
   requires having to get access to tmpfs file and mmap it in,
   then instead of just referencing a pointer to the data he
   wants to mark volatile, he has to calculate the offset from
   start of the mmap and pass those file offsets to the interface.
   Instead he mentioned that using something like madvise would be
   much nicer, since they could just pass a pointer to the object
   in memory they want to make volatile and avoid the extra work.
 
   I'm not opposed to adding an madvise interface for this as
   well, but since we have a existing use case with Android's
   ashmem, I want to make sure we support this existing behavior.
   Specifically as with ashmem  applications can be sharing
   these tmpfs fds, and so file-relative volatile ranges make
   more sense if you need to coordinate what data is volatile
   between two applications.
 
   Also, while I agree that having an madvise interface for
   volatile ranges would be nice, it does open up some more
   complex implementation issues, since with files, there is a
   fixed relationship between pages and the files' address_space
   mapping, where you can't have pages shared between different
   mappings. This makes it easy to hang the volatile-range tree
   off of the mapping (well, indirectly via a hash table). With
   general anonymous memory, pages can be shared between multiple
   processes, and as far as I understand, don't have any grouping
   structure we could use to determine if the page is in a
   volatile range or not. We would also need to determine more
   complex questions like: What are the semantics of volatility
   with copy-on-write pages?  I'm hoping to investigate this
   idea more deeply soon so I can be sure whatever is pushed has
   a clear plan of how to address this idea. Further thoughts
   here would be appreciated.

Note it doesn't have to be a vs. situation. madvise could be an
additional way to interface with volatile ranges on a given fd.

That is, madvise doesn't have to mean anonymous memory. As a matter of
fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
Similarly, there could be a way to use madvise to mark volatile ranges,
without the application having to track what memory ranges are
associated to what part of what file, which the kernel already tracks.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-09 Thread John Stultz

On 10/09/2012 01:07 AM, Mike Hommey wrote:

Note it doesn't have to be a vs. situation. madvise could be an
additional way to interface with volatile ranges on a given fd.

That is, madvise doesn't have to mean anonymous memory. As a matter of
fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
Similarly, there could be a way to use madvise to mark volatile ranges,
without the application having to track what memory ranges are
associated to what part of what file, which the kernel already tracks.


Good point. We could add madvise() interface, but limit it only to 
mmapped tmpfs files, in parallel with the fallocate() interface.


However, I would like to think through how MADV_MARK_VOLATILE with 
purely anonymous memory could work, before starting that approach. That 
and Neil's point that having an identical kernel interface restricted to 
tmpfs, only as a convenience to userland in switching from virtual 
address to/from mmapped file offset may be better left to a userland 
library.


thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-09 Thread Minchan Kim
On Tue, Oct 09, 2012 at 02:30:03PM -0700, John Stultz wrote:
 On 10/09/2012 01:07 AM, Mike Hommey wrote:
 Note it doesn't have to be a vs. situation. madvise could be an
 additional way to interface with volatile ranges on a given fd.
 
 That is, madvise doesn't have to mean anonymous memory. As a matter of
 fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
 Similarly, there could be a way to use madvise to mark volatile ranges,
 without the application having to track what memory ranges are
 associated to what part of what file, which the kernel already tracks.
 
 Good point. We could add madvise() interface, but limit it only to
 mmapped tmpfs files, in parallel with the fallocate() interface.
 
 However, I would like to think through how MADV_MARK_VOLATILE with
 purely anonymous memory could work, before starting that approach.
 That and Neil's point that having an identical kernel interface
 restricted to tmpfs, only as a convenience to userland in switching
 from virtual address to/from mmapped file offset may be better left
 to a userland library.

How about this?

The scenario I imagine about madvise semantic following as.

1) Anonymous pages
Assume that there is some allocator library which manage mmaped reserved pool.
If it has lots of free memory which isn't used by anyone, it can unmap part of
reserved pool but unmap isn't cheap because kernel should zap all ptes of the
pages in the range. But if we avoid unmap, VM would swap out that range which
have just garbage unnecessary when memory pressure happens.
If it mark that range volatile, we can avoid unnecessary swap out and even
reclaim them with no swap. Only thing allocator have to do is unmark that range
before allocating to user.

2) File pages(NOT tmpfs)
We can reclaim volatile file pages easily without recycling of LRU
although it is accessed recently.
The difference with DONTNEED is that DONTNEED always move pages to
tail of inactive LRU to reclaim early but VOLATILE semantic leave them
as it is without moving to tail and reclaim them without considering
recently-used when they reach at tail of LRU by aging because they can
be unmarked sooner or later for using and we can't expect cost of
recreating of the object.

So reclaim preference : NORMAL  VOLATILE  DONTNEED


 
 thanks
 -john
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-08 Thread Minchan Kim
On Mon, Oct 08, 2012 at 06:25:07PM -0700, John Stultz wrote:
> On 10/07/2012 11:25 PM, Minchan Kim wrote:
> >Hi John,
> >
> >On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> >>After Kernel Summit and Plumbers, I wanted to consider all the various
> >>side-discussions and try to summarize my current thoughts here along
> >>with sending out my current implementation for review.
> >>
> >>Also: I'm going on four weeks of paternity leave in the very near
> >>(but non-deterministic) future. So while I hope I still have time
> >>for some discussion, I may have to deal with fussier complaints
> >>then yours. :)  In any case, you'll have more time to chew on
> >>the idea and come up with amazing suggestions. :)
> >>
> >>
> >>General Interface semantics:
> >>--
> >>
> >>The high level interface I've been pushing has so far stayed fairly
> >>consistent:
> >>
> >>Application marks a range of data as volatile. Volatile data may
> >>be purged at any time. Accessing volatile data is undefined, so
> >>applications should not do so. If the application wants to access
> >>data in a volatile range, it should mark it as non-volatile. If any
> >>of the pages in the range being marked non-volatile had been purged,
> >>the kernel will return an error, notifying the application that the
> >>data was lost.
> >>
> >>But one interesting new tweak on this design, suggested by the Taras
> >>Glek and others at Mozilla, is as follows:
> >>
> >>Instead of leaving volatile data access as being undefined , when
> >>accessing volatile data, either the data expected will be returned
> >>if it has not been purged, or the application will get a SIGBUS when
> >>it accesses volatile data that has been purged.
> >>
> >>Everything else remains the same (error on marking non-volatile
> >>if data was purged, etc). This model allows applications to avoid
> >>having to unmark volatile data when it wants to access it, then
> >>immediately re-mark it as volatile when its done. It is in effect
> >Just out of curiosity.
> >Why should application remark it as volatile again?
> >It have been already volatile range and application doesn't receive
> >any signal while it uses that range. So I think it doesn't need to
> >remark.
> 
> Not totally sure I understand your question clearly.
> 
> So assuming one has a large cache of independently accessed objects,
> this mark-nonvolatile/access/mark-volatile pattern is useful if you
> don't want to have to deal with handling the SIGBUS.
> 
> For instance, if when accessing the data (say uncompressed image
> data), you are passing it to a library (to do something like an
> image filter, in place), where you don't want the library's access
> of the data to cause an unexpected SIGBUS that would be difficult to
> recover from.

I just confused by your word.
AFAIUC, you mean following as.

1) mark volatile
2) access pages in the range until SIGBUS happens
3) when SIGBUS happens, unmark volatile
4) access pages in the raange
5) When it's done, remark it as volatile

I agree this model.

> 
> 
> >>"lazy" with its marking, allowing the kernel to hit it with a signal
> >>when it gets unlucky and touches purged data. From the signal handler,
> >>the application can note the address it faulted on, unmark the range,
> >>and regenerate the needed data before returning to execution.
> >I like this model if plumbers really want it.
> 
> I think it makes sense.  Also it avoids a hole in the earlier
> semantics: If accessing volatile data is undefined, and you might
> get the data, or you might get zeros, there's the problem of writes
> that occur on purged ranges (Credits to Mike Hommey for pointing
> this out in his critique of Android's ashmem).  If an application
> writes data to a purged range, the range continues to be considered
> volatile, and since neighboring data may still be purged, the entire
> set is considered purged.  Because of this, we don't end up purging
> the data again (at least with the shrinker method).
> 
> By adding the SIGBUS on access of purged pages, it cleans up the
> semantics nicely.
> 
> 
> 
> >
> >>However, If applications don't want to deal with handling the
> >>sigbus, they can use the more straightforward (but more costly)
> >>unmark/access/mark pattern in the same way as my earlier proposals.
> >>
> >>This allows folks to balance the cost vs complexity in their
> >>application appropriately.
> >>
> >>So that's a general overview of how the idea I'm proposing could
> >>be used.
> >My idea is that we don't need to move all pages in the range
> >to tail of LRU or new LRU list. Just move a page in the range
> >into tail of LRU or new LRU list. And when reclaimer start to find
> >victim page, it can know this page is volatile by something
> >(ex, if we use new LRU list, we can know it easily, Otherwise,
> >we can use VMA's new flag - VM_VOLATILE and we can know it easily
> >by page_check_references's tweak) and isolate all pages of the range
> 

Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-08 Thread John Stultz

On 10/07/2012 11:25 PM, Minchan Kim wrote:

Hi John,

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:

After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)


General Interface semantics:
--

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect

Just out of curiosity.
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.


Not totally sure I understand your question clearly.

So assuming one has a large cache of independently accessed objects, 
this mark-nonvolatile/access/mark-volatile pattern is useful if you 
don't want to have to deal with handling the SIGBUS.


For instance, if when accessing the data (say uncompressed image data), 
you are passing it to a library (to do something like an image filter, 
in place), where you don't want the library's access of the data to 
cause an unexpected SIGBUS that would be difficult to recover from.




"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

I like this model if plumbers really want it.


I think it makes sense.  Also it avoids a hole in the earlier semantics: 
If accessing volatile data is undefined, and you might get the data, or 
you might get zeros, there's the problem of writes that occur on purged 
ranges (Credits to Mike Hommey for pointing this out in his critique of 
Android's ashmem).  If an application writes data to a purged range, the 
range continues to be considered volatile, and since neighboring data 
may still be purged, the entire set is considered purged.  Because of 
this, we don't end up purging the data again (at least with the shrinker 
method).


By adding the SIGBUS on access of purged pages, it cleans up the 
semantics nicely.







However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.

My idea is that we don't need to move all pages in the range
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.


So this general idea of moving a single page to represent the entire 
range has been mentioned before (I think Neil also suggested something 
similar).


But what happens if we are creating a large volatile range, most of 
which hasn't been accessed in a long time. However, the chosen flag page 
has been accessed recently.  In this case, we 

Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-08 Thread Minchan Kim
Hi John,

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> 
> After Kernel Summit and Plumbers, I wanted to consider all the various
> side-discussions and try to summarize my current thoughts here along
> with sending out my current implementation for review.
> 
> Also: I'm going on four weeks of paternity leave in the very near
> (but non-deterministic) future. So while I hope I still have time
> for some discussion, I may have to deal with fussier complaints
> then yours. :)  In any case, you'll have more time to chew on
> the idea and come up with amazing suggestions. :)
> 
> 
> General Interface semantics:
> --
> 
> The high level interface I've been pushing has so far stayed fairly
> consistent:
> 
> Application marks a range of data as volatile. Volatile data may
> be purged at any time. Accessing volatile data is undefined, so
> applications should not do so. If the application wants to access
> data in a volatile range, it should mark it as non-volatile. If any
> of the pages in the range being marked non-volatile had been purged,
> the kernel will return an error, notifying the application that the
> data was lost.
> 
> But one interesting new tweak on this design, suggested by the Taras
> Glek and others at Mozilla, is as follows:
> 
> Instead of leaving volatile data access as being undefined , when
> accessing volatile data, either the data expected will be returned
> if it has not been purged, or the application will get a SIGBUS when
> it accesses volatile data that has been purged.
> 
> Everything else remains the same (error on marking non-volatile
> if data was purged, etc). This model allows applications to avoid
> having to unmark volatile data when it wants to access it, then
> immediately re-mark it as volatile when its done. It is in effect

Just out of curiosity.
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.


> "lazy" with its marking, allowing the kernel to hit it with a signal
> when it gets unlucky and touches purged data. From the signal handler,
> the application can note the address it faulted on, unmark the range,
> and regenerate the needed data before returning to execution.

I like this model if plumbers really want it.

> 
> Since this approach avoids the more explicit unmark/access/mark
> pattern, it avoids the extra overhead required to ensure data is
> non-volatile before being accessed.

I have an idea to reduce the overhead.
See below.

> 
> However, If applications don't want to deal with handling the
> sigbus, they can use the more straightforward (but more costly)
> unmark/access/mark pattern in the same way as my earlier proposals.
> 
> This allows folks to balance the cost vs complexity in their
> application appropriately.
> 
> So that's a general overview of how the idea I'm proposing could
> be used.

My idea is that we don't need to move all pages in the range
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.

> 
> 
> 
> Specific Interface semantics:
> -
> 
> Here are some of the open question about how the user interface
> should look:
> 
> fadvise vs fallocate:
> 
>   So while originally I used fadvise, currently my
>   implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
>   start, len) to mark a range as volatile and fallocate(fd,
>   FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.
> 
>   During kernel summit, the question was brought up if fallocate
>   was really the right interface to be using, and if fadvise
>   would be better. To me fadvise makes a little more sense,
>   but earlier it was pointed out that marking data ranges as
>   volatile could also be seen as a type of cancellable and lazy
>   hole-punching, so from that perspective fallocate might make
>   more sense.  This is still an open question and I'd appreciate
>   further input here.
> 
> tmpfs vs non-shmem filesystems:
>   Android's ashmem primarily provides a way to get unlinked
>   tmpfs fds that can be shared between applications. Its
>   just an additional feature that those pages can "unpinned"
>   or marked volatile in my terminology. Thus in implementing
>   

Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-08 Thread Minchan Kim
Hi John,

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
 
 After Kernel Summit and Plumbers, I wanted to consider all the various
 side-discussions and try to summarize my current thoughts here along
 with sending out my current implementation for review.
 
 Also: I'm going on four weeks of paternity leave in the very near
 (but non-deterministic) future. So while I hope I still have time
 for some discussion, I may have to deal with fussier complaints
 then yours. :)  In any case, you'll have more time to chew on
 the idea and come up with amazing suggestions. :)
 
 
 General Interface semantics:
 --
 
 The high level interface I've been pushing has so far stayed fairly
 consistent:
 
 Application marks a range of data as volatile. Volatile data may
 be purged at any time. Accessing volatile data is undefined, so
 applications should not do so. If the application wants to access
 data in a volatile range, it should mark it as non-volatile. If any
 of the pages in the range being marked non-volatile had been purged,
 the kernel will return an error, notifying the application that the
 data was lost.
 
 But one interesting new tweak on this design, suggested by the Taras
 Glek and others at Mozilla, is as follows:
 
 Instead of leaving volatile data access as being undefined , when
 accessing volatile data, either the data expected will be returned
 if it has not been purged, or the application will get a SIGBUS when
 it accesses volatile data that has been purged.
 
 Everything else remains the same (error on marking non-volatile
 if data was purged, etc). This model allows applications to avoid
 having to unmark volatile data when it wants to access it, then
 immediately re-mark it as volatile when its done. It is in effect

Just out of curiosity.
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.


 lazy with its marking, allowing the kernel to hit it with a signal
 when it gets unlucky and touches purged data. From the signal handler,
 the application can note the address it faulted on, unmark the range,
 and regenerate the needed data before returning to execution.

I like this model if plumbers really want it.

 
 Since this approach avoids the more explicit unmark/access/mark
 pattern, it avoids the extra overhead required to ensure data is
 non-volatile before being accessed.

I have an idea to reduce the overhead.
See below.

 
 However, If applications don't want to deal with handling the
 sigbus, they can use the more straightforward (but more costly)
 unmark/access/mark pattern in the same way as my earlier proposals.
 
 This allows folks to balance the cost vs complexity in their
 application appropriately.
 
 So that's a general overview of how the idea I'm proposing could
 be used.

My idea is that we don't need to move all pages in the range
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.

 
 
 
 Specific Interface semantics:
 -
 
 Here are some of the open question about how the user interface
 should look:
 
 fadvise vs fallocate:
 
   So while originally I used fadvise, currently my
   implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
   start, len) to mark a range as volatile and fallocate(fd,
   FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.
 
   During kernel summit, the question was brought up if fallocate
   was really the right interface to be using, and if fadvise
   would be better. To me fadvise makes a little more sense,
   but earlier it was pointed out that marking data ranges as
   volatile could also be seen as a type of cancellable and lazy
   hole-punching, so from that perspective fallocate might make
   more sense.  This is still an open question and I'd appreciate
   further input here.
 
 tmpfs vs non-shmem filesystems:
   Android's ashmem primarily provides a way to get unlinked
   tmpfs fds that can be shared between applications. Its
   just an additional feature that those pages can unpinned
   or marked volatile in my terminology. Thus in implementing
   volatile ranges, I've focused on getting it to work on tmpfs
   file descriptors.  However, 

Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-08 Thread John Stultz

On 10/07/2012 11:25 PM, Minchan Kim wrote:

Hi John,

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:

After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)


General Interface semantics:
--

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect

Just out of curiosity.
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.


Not totally sure I understand your question clearly.

So assuming one has a large cache of independently accessed objects, 
this mark-nonvolatile/access/mark-volatile pattern is useful if you 
don't want to have to deal with handling the SIGBUS.


For instance, if when accessing the data (say uncompressed image data), 
you are passing it to a library (to do something like an image filter, 
in place), where you don't want the library's access of the data to 
cause an unexpected SIGBUS that would be difficult to recover from.




lazy with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

I like this model if plumbers really want it.


I think it makes sense.  Also it avoids a hole in the earlier semantics: 
If accessing volatile data is undefined, and you might get the data, or 
you might get zeros, there's the problem of writes that occur on purged 
ranges (Credits to Mike Hommey for pointing this out in his critique of 
Android's ashmem).  If an application writes data to a purged range, the 
range continues to be considered volatile, and since neighboring data 
may still be purged, the entire set is considered purged.  Because of 
this, we don't end up purging the data again (at least with the shrinker 
method).


By adding the SIGBUS on access of purged pages, it cleans up the 
semantics nicely.







However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.

My idea is that we don't need to move all pages in the range
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.


So this general idea of moving a single page to represent the entire 
range has been mentioned before (I think Neil also suggested something 
similar).


But what happens if we are creating a large volatile range, most of 
which hasn't been accessed in a long time. However, the chosen flag page 
has been accessed recently.  In this case, we might 

Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-08 Thread Minchan Kim
On Mon, Oct 08, 2012 at 06:25:07PM -0700, John Stultz wrote:
 On 10/07/2012 11:25 PM, Minchan Kim wrote:
 Hi John,
 
 On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
 After Kernel Summit and Plumbers, I wanted to consider all the various
 side-discussions and try to summarize my current thoughts here along
 with sending out my current implementation for review.
 
 Also: I'm going on four weeks of paternity leave in the very near
 (but non-deterministic) future. So while I hope I still have time
 for some discussion, I may have to deal with fussier complaints
 then yours. :)  In any case, you'll have more time to chew on
 the idea and come up with amazing suggestions. :)
 
 
 General Interface semantics:
 --
 
 The high level interface I've been pushing has so far stayed fairly
 consistent:
 
 Application marks a range of data as volatile. Volatile data may
 be purged at any time. Accessing volatile data is undefined, so
 applications should not do so. If the application wants to access
 data in a volatile range, it should mark it as non-volatile. If any
 of the pages in the range being marked non-volatile had been purged,
 the kernel will return an error, notifying the application that the
 data was lost.
 
 But one interesting new tweak on this design, suggested by the Taras
 Glek and others at Mozilla, is as follows:
 
 Instead of leaving volatile data access as being undefined , when
 accessing volatile data, either the data expected will be returned
 if it has not been purged, or the application will get a SIGBUS when
 it accesses volatile data that has been purged.
 
 Everything else remains the same (error on marking non-volatile
 if data was purged, etc). This model allows applications to avoid
 having to unmark volatile data when it wants to access it, then
 immediately re-mark it as volatile when its done. It is in effect
 Just out of curiosity.
 Why should application remark it as volatile again?
 It have been already volatile range and application doesn't receive
 any signal while it uses that range. So I think it doesn't need to
 remark.
 
 Not totally sure I understand your question clearly.
 
 So assuming one has a large cache of independently accessed objects,
 this mark-nonvolatile/access/mark-volatile pattern is useful if you
 don't want to have to deal with handling the SIGBUS.
 
 For instance, if when accessing the data (say uncompressed image
 data), you are passing it to a library (to do something like an
 image filter, in place), where you don't want the library's access
 of the data to cause an unexpected SIGBUS that would be difficult to
 recover from.

I just confused by your word.
AFAIUC, you mean following as.

1) mark volatile
2) access pages in the range until SIGBUS happens
3) when SIGBUS happens, unmark volatile
4) access pages in the raange
5) When it's done, remark it as volatile

I agree this model.

 
 
 lazy with its marking, allowing the kernel to hit it with a signal
 when it gets unlucky and touches purged data. From the signal handler,
 the application can note the address it faulted on, unmark the range,
 and regenerate the needed data before returning to execution.
 I like this model if plumbers really want it.
 
 I think it makes sense.  Also it avoids a hole in the earlier
 semantics: If accessing volatile data is undefined, and you might
 get the data, or you might get zeros, there's the problem of writes
 that occur on purged ranges (Credits to Mike Hommey for pointing
 this out in his critique of Android's ashmem).  If an application
 writes data to a purged range, the range continues to be considered
 volatile, and since neighboring data may still be purged, the entire
 set is considered purged.  Because of this, we don't end up purging
 the data again (at least with the shrinker method).
 
 By adding the SIGBUS on access of purged pages, it cleans up the
 semantics nicely.
 
 
 
 
 However, If applications don't want to deal with handling the
 sigbus, they can use the more straightforward (but more costly)
 unmark/access/mark pattern in the same way as my earlier proposals.
 
 This allows folks to balance the cost vs complexity in their
 application appropriately.
 
 So that's a general overview of how the idea I'm proposing could
 be used.
 My idea is that we don't need to move all pages in the range
 to tail of LRU or new LRU list. Just move a page in the range
 into tail of LRU or new LRU list. And when reclaimer start to find
 victim page, it can know this page is volatile by something
 (ex, if we use new LRU list, we can know it easily, Otherwise,
 we can use VMA's new flag - VM_VOLATILE and we can know it easily
 by page_check_references's tweak) and isolate all pages of the range
 in middle of LRU list and reclaim them all at once.
 So the cost of marking is just a (search cost for finding in-memory
 page of the range + moving a page between LRU or from middle to tail)
 It means we can move the 

Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-02 Thread John Stultz

On 10/02/2012 12:39 AM, NeilBrown wrote:

On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz  wrote:


After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)

  I wonder if you are trying to please everyone and risking pleasing no-one?
  Well, maybe not quite that extreme, but you can't please all the people all
  the time.
So while I do agree that I won't be able to please everyone, especially 
when it comes to how this interface is implemented internally, I do want 
to make sure that the userland interface really does make sense and 
isn't limited by my own short-sightedness.  :)



  For example, allowing sub-page volatile region seems to be above and beyond
  the call of duty.  You cannot mmap sub-pages, so why should they be volatile?
Although if someone marked a page and a half as volatile, would it be 
reasonable to throw away the second half of that second page? That seems 
unexpected to me. So we're really only marking the whole pages specified 
as volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.


But if it happens that the adjacent range is also a partial page, we can 
coalesce them possibly into an purgable whole page. I think it makes 
sense, especially from a userland point of view and wasn't really 
complicated to add.



  Similarly the suggestion of using madvise - while tempting - is probably a
  minority interest and can probably be managed with library code.  I'm glad
  you haven't pursued it.
For now I see this as a lower priority, but its something I'd like to 
investigate.  As depending on tmpfs has issues since there's no quota 
support, so having a user-writable tmpfs partition mounted is a DoS 
opening, especially on low-memory systems.



  I think discarding whole ranges at a time is very sensible, and so merging
  adjacent ranges is best avoided.  If you require page-aligned ranges this
  becomes trivial - is that right?
True. If we avoid coalescing non-whole page ranges, keeping 
non-overlapping ranges independent is fairly easy.


But it is also easy to avoid coalescing in all cases except when 
multiple sub-page ranges can be coalesced together.


In other words, we mark whole page portions of the range as volatile, 
and keep the sub-page portions separate. So non-page aligned ranges 
would possibly consist of three independent ranges, with the middle one 
as the only one marked volatile. Should those non-whole-page ranges be 
adjacent to other non-whole-page ranges, they could be coalesced. Since 
the coalesced edge ranges would be marked volatile after the full range, 
we would also avoid puriging the edge pages that would invalidate two 
unpurged range.


Alternatively, we can never coalesce and only mark whole pages in single 
ranges as volatile. It doesn't really make it more complex.


But again, these are implementation details.

The main point is I think at the user-interface level, allowing userland 
to provide non-page aligned ranges is valid. What we do with those 
non-page aligned chunks is up to the kernel/implementation, but I think 
we should be conservative and be sure never to purge non-volatile data.



  I wonder if the oldest page/oldest range issue can be defined way by
  requiring apps the touch the first page in a range when they touch the range.
  Then the age of a range is the age of the first page.  Non-initial pages
  could even be kept off the free list  though that might confuse NUMA
  page reclaim if a range had pages from different nodes.
Not sure I followed this. Are you suggesting keeping non-initial ranges 
off the vmscan LRU lists entirely?


Another appraoch that was suggested that sounds similar is touching all 
the pages when we mark them as volatile, so they are all close to each 
other in the active/inactive list. Then when the vmscan 
shrink_lru_list() code runs it would purge the pages together (although 
it might only purge half a range if there wasn't the need for more 
memory).   But again, these page-based solutions have much higher 
algorithmic complexity (O(n) - with respect to pages marked) and overhead.




  Application to non-tmpfs files seems very unclear and so probably best
  avoided.
  If I understand you correctly, then you have suggested both that a volatile
  range would be a "lazy hole punch" and a "don't let this get written to disk
  yet" flag.  It cannot really be both.  The former sounds like fallocate,
  the latter like fadvise.
I don't think I see the exclusivity aspect. If we say "Dear kernel, you 
may punch a hole at this offset in this file whenever you want 

Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-02 Thread Taras Glek

On 9/28/2012 8:16 PM, John Stultz wrote:


There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.



1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.


2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.  When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.
Unfortunately, approach 1 is not useful for our use-case. It'll mean 
that we are continuously re-decompressing frequently used parts of 
libxul.so under memory pressure(which is pretty often on limited ram 
devices).



Taras

ps. John, I really appreciate movement on this. We really need this to 
improve Firefox memory usage + startup speed on low memory devices. Will 
be great to have Firefox start faster+ respond to memory pressure better 
on desktop Linux too.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-02 Thread NeilBrown
On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz  wrote:

> 
> After Kernel Summit and Plumbers, I wanted to consider all the various
> side-discussions and try to summarize my current thoughts here along
> with sending out my current implementation for review.
> 
> Also: I'm going on four weeks of paternity leave in the very near
> (but non-deterministic) future. So while I hope I still have time
> for some discussion, I may have to deal with fussier complaints
> then yours. :)  In any case, you'll have more time to chew on
> the idea and come up with amazing suggestions. :)

Hi John,

 I wonder if you are trying to please everyone and risking pleasing no-one?
 Well, maybe not quite that extreme, but you can't please all the people all
 the time.

 For example, allowing sub-page volatile region seems to be above and beyond
 the call of duty.  You cannot mmap sub-pages, so why should they be volatile?

 Similarly the suggestion of using madvise - while tempting - is probably a
 minority interest and can probably be managed with library code.  I'm glad
 you haven't pursued it.

 I think discarding whole ranges at a time is very sensible, and so merging
 adjacent ranges is best avoided.  If you require page-aligned ranges this
 becomes trivial - is that right?

 I wonder if the oldest page/oldest range issue can be defined way by
 requiring apps the touch the first page in a range when they touch the range.
 Then the age of a range is the age of the first page.  Non-initial pages
 could even be kept off the free list  though that might confuse NUMA
 page reclaim if a range had pages from different nodes.


 Application to non-tmpfs files seems very unclear and so probably best
 avoided.
 If I understand you correctly, then you have suggested both that a volatile
 range would be a "lazy hole punch" and a "don't let this get written to disk
 yet" flag.  It cannot really be both.  The former sounds like fallocate,
 the latter like fadvise.
 I think the later sounds more like the general purpose of volatile ranges,
 but I also suspect that some journalling filesystems might be uncomfortable
 providing a guarantee like that.  So I would suggest firmly stating that it
 is a tmpfs-only feature.  If someone wants something vaguely similar for
 other filesystems, let them implement it separately.


 The SIGBUS interface could have some merit if it really reduces overhead.  I
 worry about app bugs that could result from the non-deterministic
 behaviour.   A range could get unmapped while it is in use and testing for
 the case of "get a SIGBUS half way though accessing something" would not
 be straight forward (SIGBUS on first step of access should be easy).
 I guess that is up to the app writer, but I have never liked anything about
 the signal interface and encouraging further use doesn't feel wise.

 That's my 2c worth for now.  Keep up the good work,

NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-02 Thread NeilBrown
On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org wrote:

 
 After Kernel Summit and Plumbers, I wanted to consider all the various
 side-discussions and try to summarize my current thoughts here along
 with sending out my current implementation for review.
 
 Also: I'm going on four weeks of paternity leave in the very near
 (but non-deterministic) future. So while I hope I still have time
 for some discussion, I may have to deal with fussier complaints
 then yours. :)  In any case, you'll have more time to chew on
 the idea and come up with amazing suggestions. :)

Hi John,

 I wonder if you are trying to please everyone and risking pleasing no-one?
 Well, maybe not quite that extreme, but you can't please all the people all
 the time.

 For example, allowing sub-page volatile region seems to be above and beyond
 the call of duty.  You cannot mmap sub-pages, so why should they be volatile?

 Similarly the suggestion of using madvise - while tempting - is probably a
 minority interest and can probably be managed with library code.  I'm glad
 you haven't pursued it.

 I think discarding whole ranges at a time is very sensible, and so merging
 adjacent ranges is best avoided.  If you require page-aligned ranges this
 becomes trivial - is that right?

 I wonder if the oldest page/oldest range issue can be defined way by
 requiring apps the touch the first page in a range when they touch the range.
 Then the age of a range is the age of the first page.  Non-initial pages
 could even be kept off the free list  though that might confuse NUMA
 page reclaim if a range had pages from different nodes.


 Application to non-tmpfs files seems very unclear and so probably best
 avoided.
 If I understand you correctly, then you have suggested both that a volatile
 range would be a lazy hole punch and a don't let this get written to disk
 yet flag.  It cannot really be both.  The former sounds like fallocate,
 the latter like fadvise.
 I think the later sounds more like the general purpose of volatile ranges,
 but I also suspect that some journalling filesystems might be uncomfortable
 providing a guarantee like that.  So I would suggest firmly stating that it
 is a tmpfs-only feature.  If someone wants something vaguely similar for
 other filesystems, let them implement it separately.


 The SIGBUS interface could have some merit if it really reduces overhead.  I
 worry about app bugs that could result from the non-deterministic
 behaviour.   A range could get unmapped while it is in use and testing for
 the case of get a SIGBUS half way though accessing something would not
 be straight forward (SIGBUS on first step of access should be easy).
 I guess that is up to the app writer, but I have never liked anything about
 the signal interface and encouraging further use doesn't feel wise.

 That's my 2c worth for now.  Keep up the good work,

NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-02 Thread Taras Glek

On 9/28/2012 8:16 PM, John Stultz wrote:

snip
There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.



1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.


2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.  When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.
Unfortunately, approach 1 is not useful for our use-case. It'll mean 
that we are continuously re-decompressing frequently used parts of 
libxul.so under memory pressure(which is pretty often on limited ram 
devices).



Taras

ps. John, I really appreciate movement on this. We really need this to 
improve Firefox memory usage + startup speed on low memory devices. Will 
be great to have Firefox start faster+ respond to memory pressure better 
on desktop Linux too.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words

2012-10-02 Thread John Stultz

On 10/02/2012 12:39 AM, NeilBrown wrote:

On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org wrote:


After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)

  I wonder if you are trying to please everyone and risking pleasing no-one?
  Well, maybe not quite that extreme, but you can't please all the people all
  the time.
So while I do agree that I won't be able to please everyone, especially 
when it comes to how this interface is implemented internally, I do want 
to make sure that the userland interface really does make sense and 
isn't limited by my own short-sightedness.  :)



  For example, allowing sub-page volatile region seems to be above and beyond
  the call of duty.  You cannot mmap sub-pages, so why should they be volatile?
Although if someone marked a page and a half as volatile, would it be 
reasonable to throw away the second half of that second page? That seems 
unexpected to me. So we're really only marking the whole pages specified 
as volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.


But if it happens that the adjacent range is also a partial page, we can 
coalesce them possibly into an purgable whole page. I think it makes 
sense, especially from a userland point of view and wasn't really 
complicated to add.



  Similarly the suggestion of using madvise - while tempting - is probably a
  minority interest and can probably be managed with library code.  I'm glad
  you haven't pursued it.
For now I see this as a lower priority, but its something I'd like to 
investigate.  As depending on tmpfs has issues since there's no quota 
support, so having a user-writable tmpfs partition mounted is a DoS 
opening, especially on low-memory systems.



  I think discarding whole ranges at a time is very sensible, and so merging
  adjacent ranges is best avoided.  If you require page-aligned ranges this
  becomes trivial - is that right?
True. If we avoid coalescing non-whole page ranges, keeping 
non-overlapping ranges independent is fairly easy.


But it is also easy to avoid coalescing in all cases except when 
multiple sub-page ranges can be coalesced together.


In other words, we mark whole page portions of the range as volatile, 
and keep the sub-page portions separate. So non-page aligned ranges 
would possibly consist of three independent ranges, with the middle one 
as the only one marked volatile. Should those non-whole-page ranges be 
adjacent to other non-whole-page ranges, they could be coalesced. Since 
the coalesced edge ranges would be marked volatile after the full range, 
we would also avoid puriging the edge pages that would invalidate two 
unpurged range.


Alternatively, we can never coalesce and only mark whole pages in single 
ranges as volatile. It doesn't really make it more complex.


But again, these are implementation details.

The main point is I think at the user-interface level, allowing userland 
to provide non-page aligned ranges is valid. What we do with those 
non-page aligned chunks is up to the kernel/implementation, but I think 
we should be conservative and be sure never to purge non-volatile data.



  I wonder if the oldest page/oldest range issue can be defined way by
  requiring apps the touch the first page in a range when they touch the range.
  Then the age of a range is the age of the first page.  Non-initial pages
  could even be kept off the free list  though that might confuse NUMA
  page reclaim if a range had pages from different nodes.
Not sure I followed this. Are you suggesting keeping non-initial ranges 
off the vmscan LRU lists entirely?


Another appraoch that was suggested that sounds similar is touching all 
the pages when we mark them as volatile, so they are all close to each 
other in the active/inactive list. Then when the vmscan 
shrink_lru_list() code runs it would purge the pages together (although 
it might only purge half a range if there wasn't the need for more 
memory).   But again, these page-based solutions have much higher 
algorithmic complexity (O(n) - with respect to pages marked) and overhead.




  Application to non-tmpfs files seems very unclear and so probably best
  avoided.
  If I understand you correctly, then you have suggested both that a volatile
  range would be a lazy hole punch and a don't let this get written to disk
  yet flag.  It cannot really be both.  The former sounds like fallocate,
  the latter like fadvise.
I don't think I see the exclusivity aspect. If we say Dear kernel, you 
may punch a hole at this offset in this file