Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On Fri, Nov 02, 2012 at 09:59:07PM +0100, Michael Kerrisk wrote: > John, > > A question at on one point: > > On Wed, Oct 3, 2012 at 12:38 AM, John Stultz wrote: > > On 10/02/2012 12:39 AM, NeilBrown wrote: > [...] > >> The SIGBUS interface could have some merit if it really reduces > >> overhead. I > >> worry about app bugs that could result from the non-deterministic > >> behaviour. A range could get unmapped while it is in use and testing > >> for > >> the case of "get a SIGBUS half way though accessing something" would not > >> be straight forward (SIGBUS on first step of access should be easy). > >> I guess that is up to the app writer, but I have never liked anything > >> about > >> the signal interface and encouraging further use doesn't feel wise. > > > > Initially I didn't like the idea, but have warmed considerably to it. Mainly > > due to the concern that the constant unmark/access/mark pattern would be too > > much overhead, and having a lazy method will be much nicer for performance. > > But yes, at the cost of additional complexity of handling the signal, > > marking the faulted address range as non-volatile, restoring the data and > > continuing. > > At a finer level of detail, how do you see this as happening in the > application. I mean: in the general case, repopulating the purged > volatile page would have to be done outside the signal handler (I > think, because async-signal-safety considerations would preclude too > much compdex stuff going on inside the handler). That implies > longjumping out of the handler, repopulating the pages with data, and > then restarting whatever work was being done when the SIGBUS was > generated. There are different strategies that can be used to repopulate the pages, within or outside the signal handler, and I'd say it's not that important of a detail. That being said, if the kernel could be helpful and avoid people shooting themselves in the foot, that would be great, too. I don't know how possible this would be but being able to get the notification on a signalfd in a dedicated thread would certainly improve things (I guess other usecases of SIGSEGV/SIGBUG handlers could appreciate something like this). The kernel would pause the faulting thread while sending the notification on the signalfd, and the notified thread would be allowed to resume the faulting thread when it's done doing its job. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On Fri, Nov 02, 2012 at 09:59:07PM +0100, Michael Kerrisk wrote: John, A question at on one point: On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote: On 10/02/2012 12:39 AM, NeilBrown wrote: [...] The SIGBUS interface could have some merit if it really reduces overhead. I worry about app bugs that could result from the non-deterministic behaviour. A range could get unmapped while it is in use and testing for the case of get a SIGBUS half way though accessing something would not be straight forward (SIGBUS on first step of access should be easy). I guess that is up to the app writer, but I have never liked anything about the signal interface and encouraging further use doesn't feel wise. Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing. At a finer level of detail, how do you see this as happening in the application. I mean: in the general case, repopulating the purged volatile page would have to be done outside the signal handler (I think, because async-signal-safety considerations would preclude too much compdex stuff going on inside the handler). That implies longjumping out of the handler, repopulating the pages with data, and then restarting whatever work was being done when the SIGBUS was generated. There are different strategies that can be used to repopulate the pages, within or outside the signal handler, and I'd say it's not that important of a detail. That being said, if the kernel could be helpful and avoid people shooting themselves in the foot, that would be great, too. I don't know how possible this would be but being able to get the notification on a signalfd in a dedicated thread would certainly improve things (I guess other usecases of SIGSEGV/SIGBUG handlers could appreciate something like this). The kernel would pause the faulting thread while sending the notification on the signalfd, and the notified thread would be allowed to resume the faulting thread when it's done doing its job. Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
[CC += linux-api, since this is an API change.] Hi John, A couple of other questions that occurred to me... What are the expected/planned semantics of volatile ranges for mlocked pages? I noticed that Minchan's patch series (https://lwn.net/Articles/522154/) gives an error on attempt to mark locked pages as volatile (which seems sensible). I didn't see anything similar in your patches. Perhaps it's not easy to do because of the non-VMA-based implementation? Something to think about. On Wed, Oct 3, 2012 at 12:38 AM, John Stultz wrote: > On 10/02/2012 12:39 AM, NeilBrown wrote: >> >> On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz >> wrote: >> >> For example, allowing sub-page volatile region seems to be above and >> beyond >> the call of duty. You cannot mmap sub-pages, so why should they be >> volatile? > > Although if someone marked a page and a half as volatile, would it be > reasonable to throw away the second half of that second page? That seems > unexpected to me. So we're really only marking the whole pages specified as > volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves. > > But if it happens that the adjacent range is also a partial page, we can > coalesce them possibly into an purgable whole page. I think it makes sense, > especially from a userland point of view and wasn't really complicated to > add. I must confess that I'm puzzled by this facility to lock sub-page range ranges as well. What's the use case? What I'm thinking is: the goal of volatile ranges is to help improve system performance by freeing up a (sizeable) block of pages. Why then would the user care too much about marking with sub-page granularity, or that such ranges might be merged? After all, the system calls to do this marking are expensive, and so for performance reasons, I suppose that a process would like to keep those system calls to a minimum. [...] >> I think discarding whole ranges at a time is very sensible, and so >> merging >> adjacent ranges is best avoided. If you require page-aligned ranges >> this >> becomes trivial - is that right? > > True. If we avoid coalescing non-whole page ranges, keeping non-overlapping > ranges independent is fairly easy. Regarding coalescing of adjacent ranges. Here's one possible argument against it (Jake Edge alerted me to this). If an application marked adjacent ranges using separate system calls, that might be an indication that the application intends to to have different access patterns against the two ranges: one frequent, the other rare. In that case, I suppose it would be better if the ranges were not merged. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
[CC += linux-api, since this is an API change.] Hi John, A couple of other questions that occurred to me... What are the expected/planned semantics of volatile ranges for mlocked pages? I noticed that Minchan's patch series (https://lwn.net/Articles/522154/) gives an error on attempt to mark locked pages as volatile (which seems sensible). I didn't see anything similar in your patches. Perhaps it's not easy to do because of the non-VMA-based implementation? Something to think about. On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote: On 10/02/2012 12:39 AM, NeilBrown wrote: On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org wrote: For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Although if someone marked a page and a half as volatile, would it be reasonable to throw away the second half of that second page? That seems unexpected to me. So we're really only marking the whole pages specified as volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves. But if it happens that the adjacent range is also a partial page, we can coalesce them possibly into an purgable whole page. I think it makes sense, especially from a userland point of view and wasn't really complicated to add. I must confess that I'm puzzled by this facility to lock sub-page range ranges as well. What's the use case? What I'm thinking is: the goal of volatile ranges is to help improve system performance by freeing up a (sizeable) block of pages. Why then would the user care too much about marking with sub-page granularity, or that such ranges might be merged? After all, the system calls to do this marking are expensive, and so for performance reasons, I suppose that a process would like to keep those system calls to a minimum. [...] I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? True. If we avoid coalescing non-whole page ranges, keeping non-overlapping ranges independent is fairly easy. Regarding coalescing of adjacent ranges. Here's one possible argument against it (Jake Edge alerted me to this). If an application marked adjacent ranges using separate system calls, that might be an indication that the application intends to to have different access patterns against the two ranges: one frequent, the other rare. In that case, I suppose it would be better if the ranges were not merged. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
John, A question at on one point: On Wed, Oct 3, 2012 at 12:38 AM, John Stultz wrote: > On 10/02/2012 12:39 AM, NeilBrown wrote: [...] >> The SIGBUS interface could have some merit if it really reduces >> overhead. I >> worry about app bugs that could result from the non-deterministic >> behaviour. A range could get unmapped while it is in use and testing >> for >> the case of "get a SIGBUS half way though accessing something" would not >> be straight forward (SIGBUS on first step of access should be easy). >> I guess that is up to the app writer, but I have never liked anything >> about >> the signal interface and encouraging further use doesn't feel wise. > > Initially I didn't like the idea, but have warmed considerably to it. Mainly > due to the concern that the constant unmark/access/mark pattern would be too > much overhead, and having a lazy method will be much nicer for performance. > But yes, at the cost of additional complexity of handling the signal, > marking the faulted address range as non-volatile, restoring the data and > continuing. At a finer level of detail, how do you see this as happening in the application. I mean: in the general case, repopulating the purged volatile page would have to be done outside the signal handler (I think, because async-signal-safety considerations would preclude too much compdex stuff going on inside the handler). That implies longjumping out of the handler, repopulating the pages with data, and then restarting whatever work was being done when the SIGBUS was generated. Cheers, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
John, A question at on one point: On Wed, Oct 3, 2012 at 12:38 AM, John Stultz john.stu...@linaro.org wrote: On 10/02/2012 12:39 AM, NeilBrown wrote: [...] The SIGBUS interface could have some merit if it really reduces overhead. I worry about app bugs that could result from the non-deterministic behaviour. A range could get unmapped while it is in use and testing for the case of get a SIGBUS half way though accessing something would not be straight forward (SIGBUS on first step of access should be easy). I guess that is up to the app writer, but I have never liked anything about the signal interface and encouraging further use doesn't feel wise. Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing. At a finer level of detail, how do you see this as happening in the application. I mean: in the general case, repopulating the purged volatile page would have to be done outside the signal handler (I think, because async-signal-safety considerations would preclude too much compdex stuff going on inside the handler). That implies longjumping out of the handler, repopulating the pages with data, and then restarting whatever work was being done when the SIGBUS was generated. Cheers, Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On Tue, Oct 09, 2012 at 02:30:03PM -0700, John Stultz wrote: > On 10/09/2012 01:07 AM, Mike Hommey wrote: > >Note it doesn't have to be a vs. situation. madvise could be an > >additional way to interface with volatile ranges on a given fd. > > > >That is, madvise doesn't have to mean anonymous memory. As a matter of > >fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. > >Similarly, there could be a way to use madvise to mark volatile ranges, > >without the application having to track what memory ranges are > >associated to what part of what file, which the kernel already tracks. > > Good point. We could add madvise() interface, but limit it only to > mmapped tmpfs files, in parallel with the fallocate() interface. > > However, I would like to think through how MADV_MARK_VOLATILE with > purely anonymous memory could work, before starting that approach. > That and Neil's point that having an identical kernel interface > restricted to tmpfs, only as a convenience to userland in switching > from virtual address to/from mmapped file offset may be better left > to a userland library. How about this? The scenario I imagine about madvise semantic following as. 1) Anonymous pages Assume that there is some allocator library which manage mmaped reserved pool. If it has lots of free memory which isn't used by anyone, it can unmap part of reserved pool but unmap isn't cheap because kernel should zap all ptes of the pages in the range. But if we avoid unmap, VM would swap out that range which have just garbage unnecessary when memory pressure happens. If it mark that range volatile, we can avoid unnecessary swap out and even reclaim them with no swap. Only thing allocator have to do is unmark that range before allocating to user. 2) File pages(NOT tmpfs) We can reclaim volatile file pages easily without recycling of LRU although it is accessed recently. The difference with DONTNEED is that DONTNEED always move pages to tail of inactive LRU to reclaim early but VOLATILE semantic leave them as it is without moving to tail and reclaim them without considering recently-used when they reach at tail of LRU by aging because they can be unmarked sooner or later for using and we can't expect cost of recreating of the object. So reclaim preference : NORMAL < VOLATILE < DONTNEED > > thanks > -john > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On 10/09/2012 01:07 AM, Mike Hommey wrote: Note it doesn't have to be a vs. situation. madvise could be an additional way to interface with volatile ranges on a given fd. That is, madvise doesn't have to mean anonymous memory. As a matter of fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. Similarly, there could be a way to use madvise to mark volatile ranges, without the application having to track what memory ranges are associated to what part of what file, which the kernel already tracks. Good point. We could add madvise() interface, but limit it only to mmapped tmpfs files, in parallel with the fallocate() interface. However, I would like to think through how MADV_MARK_VOLATILE with purely anonymous memory could work, before starting that approach. That and Neil's point that having an identical kernel interface restricted to tmpfs, only as a convenience to userland in switching from virtual address to/from mmapped file offset may be better left to a userland library. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: > fd based interfaces vs madvise: > In talking with Taras Glek, he pointed out that for his > needs, the fd based interface is a little annoying, as it > requires having to get access to tmpfs file and mmap it in, > then instead of just referencing a pointer to the data he > wants to mark volatile, he has to calculate the offset from > start of the mmap and pass those file offsets to the interface. > Instead he mentioned that using something like madvise would be > much nicer, since they could just pass a pointer to the object > in memory they want to make volatile and avoid the extra work. > > I'm not opposed to adding an madvise interface for this as > well, but since we have a existing use case with Android's > ashmem, I want to make sure we support this existing behavior. > Specifically as with ashmem applications can be sharing > these tmpfs fds, and so file-relative volatile ranges make > more sense if you need to coordinate what data is volatile > between two applications. > > Also, while I agree that having an madvise interface for > volatile ranges would be nice, it does open up some more > complex implementation issues, since with files, there is a > fixed relationship between pages and the files' address_space > mapping, where you can't have pages shared between different > mappings. This makes it easy to hang the volatile-range tree > off of the mapping (well, indirectly via a hash table). With > general anonymous memory, pages can be shared between multiple > processes, and as far as I understand, don't have any grouping > structure we could use to determine if the page is in a > volatile range or not. We would also need to determine more > complex questions like: What are the semantics of volatility > with copy-on-write pages? I'm hoping to investigate this > idea more deeply soon so I can be sure whatever is pushed has > a clear plan of how to address this idea. Further thoughts > here would be appreciated. Note it doesn't have to be a vs. situation. madvise could be an additional way to interface with volatile ranges on a given fd. That is, madvise doesn't have to mean anonymous memory. As a matter of fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. Similarly, there could be a way to use madvise to mark volatile ranges, without the application having to track what memory ranges are associated to what part of what file, which the kernel already tracks. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: fd based interfaces vs madvise: In talking with Taras Glek, he pointed out that for his needs, the fd based interface is a little annoying, as it requires having to get access to tmpfs file and mmap it in, then instead of just referencing a pointer to the data he wants to mark volatile, he has to calculate the offset from start of the mmap and pass those file offsets to the interface. Instead he mentioned that using something like madvise would be much nicer, since they could just pass a pointer to the object in memory they want to make volatile and avoid the extra work. I'm not opposed to adding an madvise interface for this as well, but since we have a existing use case with Android's ashmem, I want to make sure we support this existing behavior. Specifically as with ashmem applications can be sharing these tmpfs fds, and so file-relative volatile ranges make more sense if you need to coordinate what data is volatile between two applications. Also, while I agree that having an madvise interface for volatile ranges would be nice, it does open up some more complex implementation issues, since with files, there is a fixed relationship between pages and the files' address_space mapping, where you can't have pages shared between different mappings. This makes it easy to hang the volatile-range tree off of the mapping (well, indirectly via a hash table). With general anonymous memory, pages can be shared between multiple processes, and as far as I understand, don't have any grouping structure we could use to determine if the page is in a volatile range or not. We would also need to determine more complex questions like: What are the semantics of volatility with copy-on-write pages? I'm hoping to investigate this idea more deeply soon so I can be sure whatever is pushed has a clear plan of how to address this idea. Further thoughts here would be appreciated. Note it doesn't have to be a vs. situation. madvise could be an additional way to interface with volatile ranges on a given fd. That is, madvise doesn't have to mean anonymous memory. As a matter of fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. Similarly, there could be a way to use madvise to mark volatile ranges, without the application having to track what memory ranges are associated to what part of what file, which the kernel already tracks. Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On 10/09/2012 01:07 AM, Mike Hommey wrote: Note it doesn't have to be a vs. situation. madvise could be an additional way to interface with volatile ranges on a given fd. That is, madvise doesn't have to mean anonymous memory. As a matter of fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. Similarly, there could be a way to use madvise to mark volatile ranges, without the application having to track what memory ranges are associated to what part of what file, which the kernel already tracks. Good point. We could add madvise() interface, but limit it only to mmapped tmpfs files, in parallel with the fallocate() interface. However, I would like to think through how MADV_MARK_VOLATILE with purely anonymous memory could work, before starting that approach. That and Neil's point that having an identical kernel interface restricted to tmpfs, only as a convenience to userland in switching from virtual address to/from mmapped file offset may be better left to a userland library. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On Tue, Oct 09, 2012 at 02:30:03PM -0700, John Stultz wrote: On 10/09/2012 01:07 AM, Mike Hommey wrote: Note it doesn't have to be a vs. situation. madvise could be an additional way to interface with volatile ranges on a given fd. That is, madvise doesn't have to mean anonymous memory. As a matter of fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files. Similarly, there could be a way to use madvise to mark volatile ranges, without the application having to track what memory ranges are associated to what part of what file, which the kernel already tracks. Good point. We could add madvise() interface, but limit it only to mmapped tmpfs files, in parallel with the fallocate() interface. However, I would like to think through how MADV_MARK_VOLATILE with purely anonymous memory could work, before starting that approach. That and Neil's point that having an identical kernel interface restricted to tmpfs, only as a convenience to userland in switching from virtual address to/from mmapped file offset may be better left to a userland library. How about this? The scenario I imagine about madvise semantic following as. 1) Anonymous pages Assume that there is some allocator library which manage mmaped reserved pool. If it has lots of free memory which isn't used by anyone, it can unmap part of reserved pool but unmap isn't cheap because kernel should zap all ptes of the pages in the range. But if we avoid unmap, VM would swap out that range which have just garbage unnecessary when memory pressure happens. If it mark that range volatile, we can avoid unnecessary swap out and even reclaim them with no swap. Only thing allocator have to do is unmark that range before allocating to user. 2) File pages(NOT tmpfs) We can reclaim volatile file pages easily without recycling of LRU although it is accessed recently. The difference with DONTNEED is that DONTNEED always move pages to tail of inactive LRU to reclaim early but VOLATILE semantic leave them as it is without moving to tail and reclaim them without considering recently-used when they reach at tail of LRU by aging because they can be unmarked sooner or later for using and we can't expect cost of recreating of the object. So reclaim preference : NORMAL VOLATILE DONTNEED thanks -john -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On Mon, Oct 08, 2012 at 06:25:07PM -0700, John Stultz wrote: > On 10/07/2012 11:25 PM, Minchan Kim wrote: > >Hi John, > > > >On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: > >>After Kernel Summit and Plumbers, I wanted to consider all the various > >>side-discussions and try to summarize my current thoughts here along > >>with sending out my current implementation for review. > >> > >>Also: I'm going on four weeks of paternity leave in the very near > >>(but non-deterministic) future. So while I hope I still have time > >>for some discussion, I may have to deal with fussier complaints > >>then yours. :) In any case, you'll have more time to chew on > >>the idea and come up with amazing suggestions. :) > >> > >> > >>General Interface semantics: > >>-- > >> > >>The high level interface I've been pushing has so far stayed fairly > >>consistent: > >> > >>Application marks a range of data as volatile. Volatile data may > >>be purged at any time. Accessing volatile data is undefined, so > >>applications should not do so. If the application wants to access > >>data in a volatile range, it should mark it as non-volatile. If any > >>of the pages in the range being marked non-volatile had been purged, > >>the kernel will return an error, notifying the application that the > >>data was lost. > >> > >>But one interesting new tweak on this design, suggested by the Taras > >>Glek and others at Mozilla, is as follows: > >> > >>Instead of leaving volatile data access as being undefined , when > >>accessing volatile data, either the data expected will be returned > >>if it has not been purged, or the application will get a SIGBUS when > >>it accesses volatile data that has been purged. > >> > >>Everything else remains the same (error on marking non-volatile > >>if data was purged, etc). This model allows applications to avoid > >>having to unmark volatile data when it wants to access it, then > >>immediately re-mark it as volatile when its done. It is in effect > >Just out of curiosity. > >Why should application remark it as volatile again? > >It have been already volatile range and application doesn't receive > >any signal while it uses that range. So I think it doesn't need to > >remark. > > Not totally sure I understand your question clearly. > > So assuming one has a large cache of independently accessed objects, > this mark-nonvolatile/access/mark-volatile pattern is useful if you > don't want to have to deal with handling the SIGBUS. > > For instance, if when accessing the data (say uncompressed image > data), you are passing it to a library (to do something like an > image filter, in place), where you don't want the library's access > of the data to cause an unexpected SIGBUS that would be difficult to > recover from. I just confused by your word. AFAIUC, you mean following as. 1) mark volatile 2) access pages in the range until SIGBUS happens 3) when SIGBUS happens, unmark volatile 4) access pages in the raange 5) When it's done, remark it as volatile I agree this model. > > > >>"lazy" with its marking, allowing the kernel to hit it with a signal > >>when it gets unlucky and touches purged data. From the signal handler, > >>the application can note the address it faulted on, unmark the range, > >>and regenerate the needed data before returning to execution. > >I like this model if plumbers really want it. > > I think it makes sense. Also it avoids a hole in the earlier > semantics: If accessing volatile data is undefined, and you might > get the data, or you might get zeros, there's the problem of writes > that occur on purged ranges (Credits to Mike Hommey for pointing > this out in his critique of Android's ashmem). If an application > writes data to a purged range, the range continues to be considered > volatile, and since neighboring data may still be purged, the entire > set is considered purged. Because of this, we don't end up purging > the data again (at least with the shrinker method). > > By adding the SIGBUS on access of purged pages, it cleans up the > semantics nicely. > > > > > > >>However, If applications don't want to deal with handling the > >>sigbus, they can use the more straightforward (but more costly) > >>unmark/access/mark pattern in the same way as my earlier proposals. > >> > >>This allows folks to balance the cost vs complexity in their > >>application appropriately. > >> > >>So that's a general overview of how the idea I'm proposing could > >>be used. > >My idea is that we don't need to move all pages in the range > >to tail of LRU or new LRU list. Just move a page in the range > >into tail of LRU or new LRU list. And when reclaimer start to find > >victim page, it can know this page is volatile by something > >(ex, if we use new LRU list, we can know it easily, Otherwise, > >we can use VMA's new flag - VM_VOLATILE and we can know it easily > >by page_check_references's tweak) and isolate all pages of the range >
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On 10/07/2012 11:25 PM, Minchan Kim wrote: Hi John, On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) General Interface semantics: -- The high level interface I've been pushing has so far stayed fairly consistent: Application marks a range of data as volatile. Volatile data may be purged at any time. Accessing volatile data is undefined, so applications should not do so. If the application wants to access data in a volatile range, it should mark it as non-volatile. If any of the pages in the range being marked non-volatile had been purged, the kernel will return an error, notifying the application that the data was lost. But one interesting new tweak on this design, suggested by the Taras Glek and others at Mozilla, is as follows: Instead of leaving volatile data access as being undefined , when accessing volatile data, either the data expected will be returned if it has not been purged, or the application will get a SIGBUS when it accesses volatile data that has been purged. Everything else remains the same (error on marking non-volatile if data was purged, etc). This model allows applications to avoid having to unmark volatile data when it wants to access it, then immediately re-mark it as volatile when its done. It is in effect Just out of curiosity. Why should application remark it as volatile again? It have been already volatile range and application doesn't receive any signal while it uses that range. So I think it doesn't need to remark. Not totally sure I understand your question clearly. So assuming one has a large cache of independently accessed objects, this mark-nonvolatile/access/mark-volatile pattern is useful if you don't want to have to deal with handling the SIGBUS. For instance, if when accessing the data (say uncompressed image data), you are passing it to a library (to do something like an image filter, in place), where you don't want the library's access of the data to cause an unexpected SIGBUS that would be difficult to recover from. "lazy" with its marking, allowing the kernel to hit it with a signal when it gets unlucky and touches purged data. From the signal handler, the application can note the address it faulted on, unmark the range, and regenerate the needed data before returning to execution. I like this model if plumbers really want it. I think it makes sense. Also it avoids a hole in the earlier semantics: If accessing volatile data is undefined, and you might get the data, or you might get zeros, there's the problem of writes that occur on purged ranges (Credits to Mike Hommey for pointing this out in his critique of Android's ashmem). If an application writes data to a purged range, the range continues to be considered volatile, and since neighboring data may still be purged, the entire set is considered purged. Because of this, we don't end up purging the data again (at least with the shrinker method). By adding the SIGBUS on access of purged pages, it cleans up the semantics nicely. However, If applications don't want to deal with handling the sigbus, they can use the more straightforward (but more costly) unmark/access/mark pattern in the same way as my earlier proposals. This allows folks to balance the cost vs complexity in their application appropriately. So that's a general overview of how the idea I'm proposing could be used. My idea is that we don't need to move all pages in the range to tail of LRU or new LRU list. Just move a page in the range into tail of LRU or new LRU list. And when reclaimer start to find victim page, it can know this page is volatile by something (ex, if we use new LRU list, we can know it easily, Otherwise, we can use VMA's new flag - VM_VOLATILE and we can know it easily by page_check_references's tweak) and isolate all pages of the range in middle of LRU list and reclaim them all at once. So the cost of marking is just a (search cost for finding in-memory page of the range + moving a page between LRU or from middle to tail) It means we can move the cost time from mark/unmark to reclaim point. So this general idea of moving a single page to represent the entire range has been mentioned before (I think Neil also suggested something similar). But what happens if we are creating a large volatile range, most of which hasn't been accessed in a long time. However, the chosen flag page has been accessed recently. In this case, we
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
Hi John, On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: > > After Kernel Summit and Plumbers, I wanted to consider all the various > side-discussions and try to summarize my current thoughts here along > with sending out my current implementation for review. > > Also: I'm going on four weeks of paternity leave in the very near > (but non-deterministic) future. So while I hope I still have time > for some discussion, I may have to deal with fussier complaints > then yours. :) In any case, you'll have more time to chew on > the idea and come up with amazing suggestions. :) > > > General Interface semantics: > -- > > The high level interface I've been pushing has so far stayed fairly > consistent: > > Application marks a range of data as volatile. Volatile data may > be purged at any time. Accessing volatile data is undefined, so > applications should not do so. If the application wants to access > data in a volatile range, it should mark it as non-volatile. If any > of the pages in the range being marked non-volatile had been purged, > the kernel will return an error, notifying the application that the > data was lost. > > But one interesting new tweak on this design, suggested by the Taras > Glek and others at Mozilla, is as follows: > > Instead of leaving volatile data access as being undefined , when > accessing volatile data, either the data expected will be returned > if it has not been purged, or the application will get a SIGBUS when > it accesses volatile data that has been purged. > > Everything else remains the same (error on marking non-volatile > if data was purged, etc). This model allows applications to avoid > having to unmark volatile data when it wants to access it, then > immediately re-mark it as volatile when its done. It is in effect Just out of curiosity. Why should application remark it as volatile again? It have been already volatile range and application doesn't receive any signal while it uses that range. So I think it doesn't need to remark. > "lazy" with its marking, allowing the kernel to hit it with a signal > when it gets unlucky and touches purged data. From the signal handler, > the application can note the address it faulted on, unmark the range, > and regenerate the needed data before returning to execution. I like this model if plumbers really want it. > > Since this approach avoids the more explicit unmark/access/mark > pattern, it avoids the extra overhead required to ensure data is > non-volatile before being accessed. I have an idea to reduce the overhead. See below. > > However, If applications don't want to deal with handling the > sigbus, they can use the more straightforward (but more costly) > unmark/access/mark pattern in the same way as my earlier proposals. > > This allows folks to balance the cost vs complexity in their > application appropriately. > > So that's a general overview of how the idea I'm proposing could > be used. My idea is that we don't need to move all pages in the range to tail of LRU or new LRU list. Just move a page in the range into tail of LRU or new LRU list. And when reclaimer start to find victim page, it can know this page is volatile by something (ex, if we use new LRU list, we can know it easily, Otherwise, we can use VMA's new flag - VM_VOLATILE and we can know it easily by page_check_references's tweak) and isolate all pages of the range in middle of LRU list and reclaim them all at once. So the cost of marking is just a (search cost for finding in-memory page of the range + moving a page between LRU or from middle to tail) It means we can move the cost time from mark/unmark to reclaim point. > > > > Specific Interface semantics: > - > > Here are some of the open question about how the user interface > should look: > > fadvise vs fallocate: > > So while originally I used fadvise, currently my > implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE, > start, len) to mark a range as volatile and fallocate(fd, > FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges. > > During kernel summit, the question was brought up if fallocate > was really the right interface to be using, and if fadvise > would be better. To me fadvise makes a little more sense, > but earlier it was pointed out that marking data ranges as > volatile could also be seen as a type of cancellable and lazy > hole-punching, so from that perspective fallocate might make > more sense. This is still an open question and I'd appreciate > further input here. > > tmpfs vs non-shmem filesystems: > Android's ashmem primarily provides a way to get unlinked > tmpfs fds that can be shared between applications. Its > just an additional feature that those pages can "unpinned" > or marked volatile in my terminology. Thus in implementing >
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
Hi John, On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) General Interface semantics: -- The high level interface I've been pushing has so far stayed fairly consistent: Application marks a range of data as volatile. Volatile data may be purged at any time. Accessing volatile data is undefined, so applications should not do so. If the application wants to access data in a volatile range, it should mark it as non-volatile. If any of the pages in the range being marked non-volatile had been purged, the kernel will return an error, notifying the application that the data was lost. But one interesting new tweak on this design, suggested by the Taras Glek and others at Mozilla, is as follows: Instead of leaving volatile data access as being undefined , when accessing volatile data, either the data expected will be returned if it has not been purged, or the application will get a SIGBUS when it accesses volatile data that has been purged. Everything else remains the same (error on marking non-volatile if data was purged, etc). This model allows applications to avoid having to unmark volatile data when it wants to access it, then immediately re-mark it as volatile when its done. It is in effect Just out of curiosity. Why should application remark it as volatile again? It have been already volatile range and application doesn't receive any signal while it uses that range. So I think it doesn't need to remark. lazy with its marking, allowing the kernel to hit it with a signal when it gets unlucky and touches purged data. From the signal handler, the application can note the address it faulted on, unmark the range, and regenerate the needed data before returning to execution. I like this model if plumbers really want it. Since this approach avoids the more explicit unmark/access/mark pattern, it avoids the extra overhead required to ensure data is non-volatile before being accessed. I have an idea to reduce the overhead. See below. However, If applications don't want to deal with handling the sigbus, they can use the more straightforward (but more costly) unmark/access/mark pattern in the same way as my earlier proposals. This allows folks to balance the cost vs complexity in their application appropriately. So that's a general overview of how the idea I'm proposing could be used. My idea is that we don't need to move all pages in the range to tail of LRU or new LRU list. Just move a page in the range into tail of LRU or new LRU list. And when reclaimer start to find victim page, it can know this page is volatile by something (ex, if we use new LRU list, we can know it easily, Otherwise, we can use VMA's new flag - VM_VOLATILE and we can know it easily by page_check_references's tweak) and isolate all pages of the range in middle of LRU list and reclaim them all at once. So the cost of marking is just a (search cost for finding in-memory page of the range + moving a page between LRU or from middle to tail) It means we can move the cost time from mark/unmark to reclaim point. Specific Interface semantics: - Here are some of the open question about how the user interface should look: fadvise vs fallocate: So while originally I used fadvise, currently my implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE, start, len) to mark a range as volatile and fallocate(fd, FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges. During kernel summit, the question was brought up if fallocate was really the right interface to be using, and if fadvise would be better. To me fadvise makes a little more sense, but earlier it was pointed out that marking data ranges as volatile could also be seen as a type of cancellable and lazy hole-punching, so from that perspective fallocate might make more sense. This is still an open question and I'd appreciate further input here. tmpfs vs non-shmem filesystems: Android's ashmem primarily provides a way to get unlinked tmpfs fds that can be shared between applications. Its just an additional feature that those pages can unpinned or marked volatile in my terminology. Thus in implementing volatile ranges, I've focused on getting it to work on tmpfs file descriptors. However,
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On 10/07/2012 11:25 PM, Minchan Kim wrote: Hi John, On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) General Interface semantics: -- The high level interface I've been pushing has so far stayed fairly consistent: Application marks a range of data as volatile. Volatile data may be purged at any time. Accessing volatile data is undefined, so applications should not do so. If the application wants to access data in a volatile range, it should mark it as non-volatile. If any of the pages in the range being marked non-volatile had been purged, the kernel will return an error, notifying the application that the data was lost. But one interesting new tweak on this design, suggested by the Taras Glek and others at Mozilla, is as follows: Instead of leaving volatile data access as being undefined , when accessing volatile data, either the data expected will be returned if it has not been purged, or the application will get a SIGBUS when it accesses volatile data that has been purged. Everything else remains the same (error on marking non-volatile if data was purged, etc). This model allows applications to avoid having to unmark volatile data when it wants to access it, then immediately re-mark it as volatile when its done. It is in effect Just out of curiosity. Why should application remark it as volatile again? It have been already volatile range and application doesn't receive any signal while it uses that range. So I think it doesn't need to remark. Not totally sure I understand your question clearly. So assuming one has a large cache of independently accessed objects, this mark-nonvolatile/access/mark-volatile pattern is useful if you don't want to have to deal with handling the SIGBUS. For instance, if when accessing the data (say uncompressed image data), you are passing it to a library (to do something like an image filter, in place), where you don't want the library's access of the data to cause an unexpected SIGBUS that would be difficult to recover from. lazy with its marking, allowing the kernel to hit it with a signal when it gets unlucky and touches purged data. From the signal handler, the application can note the address it faulted on, unmark the range, and regenerate the needed data before returning to execution. I like this model if plumbers really want it. I think it makes sense. Also it avoids a hole in the earlier semantics: If accessing volatile data is undefined, and you might get the data, or you might get zeros, there's the problem of writes that occur on purged ranges (Credits to Mike Hommey for pointing this out in his critique of Android's ashmem). If an application writes data to a purged range, the range continues to be considered volatile, and since neighboring data may still be purged, the entire set is considered purged. Because of this, we don't end up purging the data again (at least with the shrinker method). By adding the SIGBUS on access of purged pages, it cleans up the semantics nicely. However, If applications don't want to deal with handling the sigbus, they can use the more straightforward (but more costly) unmark/access/mark pattern in the same way as my earlier proposals. This allows folks to balance the cost vs complexity in their application appropriately. So that's a general overview of how the idea I'm proposing could be used. My idea is that we don't need to move all pages in the range to tail of LRU or new LRU list. Just move a page in the range into tail of LRU or new LRU list. And when reclaimer start to find victim page, it can know this page is volatile by something (ex, if we use new LRU list, we can know it easily, Otherwise, we can use VMA's new flag - VM_VOLATILE and we can know it easily by page_check_references's tweak) and isolate all pages of the range in middle of LRU list and reclaim them all at once. So the cost of marking is just a (search cost for finding in-memory page of the range + moving a page between LRU or from middle to tail) It means we can move the cost time from mark/unmark to reclaim point. So this general idea of moving a single page to represent the entire range has been mentioned before (I think Neil also suggested something similar). But what happens if we are creating a large volatile range, most of which hasn't been accessed in a long time. However, the chosen flag page has been accessed recently. In this case, we might
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On Mon, Oct 08, 2012 at 06:25:07PM -0700, John Stultz wrote: On 10/07/2012 11:25 PM, Minchan Kim wrote: Hi John, On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) General Interface semantics: -- The high level interface I've been pushing has so far stayed fairly consistent: Application marks a range of data as volatile. Volatile data may be purged at any time. Accessing volatile data is undefined, so applications should not do so. If the application wants to access data in a volatile range, it should mark it as non-volatile. If any of the pages in the range being marked non-volatile had been purged, the kernel will return an error, notifying the application that the data was lost. But one interesting new tweak on this design, suggested by the Taras Glek and others at Mozilla, is as follows: Instead of leaving volatile data access as being undefined , when accessing volatile data, either the data expected will be returned if it has not been purged, or the application will get a SIGBUS when it accesses volatile data that has been purged. Everything else remains the same (error on marking non-volatile if data was purged, etc). This model allows applications to avoid having to unmark volatile data when it wants to access it, then immediately re-mark it as volatile when its done. It is in effect Just out of curiosity. Why should application remark it as volatile again? It have been already volatile range and application doesn't receive any signal while it uses that range. So I think it doesn't need to remark. Not totally sure I understand your question clearly. So assuming one has a large cache of independently accessed objects, this mark-nonvolatile/access/mark-volatile pattern is useful if you don't want to have to deal with handling the SIGBUS. For instance, if when accessing the data (say uncompressed image data), you are passing it to a library (to do something like an image filter, in place), where you don't want the library's access of the data to cause an unexpected SIGBUS that would be difficult to recover from. I just confused by your word. AFAIUC, you mean following as. 1) mark volatile 2) access pages in the range until SIGBUS happens 3) when SIGBUS happens, unmark volatile 4) access pages in the raange 5) When it's done, remark it as volatile I agree this model. lazy with its marking, allowing the kernel to hit it with a signal when it gets unlucky and touches purged data. From the signal handler, the application can note the address it faulted on, unmark the range, and regenerate the needed data before returning to execution. I like this model if plumbers really want it. I think it makes sense. Also it avoids a hole in the earlier semantics: If accessing volatile data is undefined, and you might get the data, or you might get zeros, there's the problem of writes that occur on purged ranges (Credits to Mike Hommey for pointing this out in his critique of Android's ashmem). If an application writes data to a purged range, the range continues to be considered volatile, and since neighboring data may still be purged, the entire set is considered purged. Because of this, we don't end up purging the data again (at least with the shrinker method). By adding the SIGBUS on access of purged pages, it cleans up the semantics nicely. However, If applications don't want to deal with handling the sigbus, they can use the more straightforward (but more costly) unmark/access/mark pattern in the same way as my earlier proposals. This allows folks to balance the cost vs complexity in their application appropriately. So that's a general overview of how the idea I'm proposing could be used. My idea is that we don't need to move all pages in the range to tail of LRU or new LRU list. Just move a page in the range into tail of LRU or new LRU list. And when reclaimer start to find victim page, it can know this page is volatile by something (ex, if we use new LRU list, we can know it easily, Otherwise, we can use VMA's new flag - VM_VOLATILE and we can know it easily by page_check_references's tweak) and isolate all pages of the range in middle of LRU list and reclaim them all at once. So the cost of marking is just a (search cost for finding in-memory page of the range + moving a page between LRU or from middle to tail) It means we can move the
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On 10/02/2012 12:39 AM, NeilBrown wrote: On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) I wonder if you are trying to please everyone and risking pleasing no-one? Well, maybe not quite that extreme, but you can't please all the people all the time. So while I do agree that I won't be able to please everyone, especially when it comes to how this interface is implemented internally, I do want to make sure that the userland interface really does make sense and isn't limited by my own short-sightedness. :) For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Although if someone marked a page and a half as volatile, would it be reasonable to throw away the second half of that second page? That seems unexpected to me. So we're really only marking the whole pages specified as volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves. But if it happens that the adjacent range is also a partial page, we can coalesce them possibly into an purgable whole page. I think it makes sense, especially from a userland point of view and wasn't really complicated to add. Similarly the suggestion of using madvise - while tempting - is probably a minority interest and can probably be managed with library code. I'm glad you haven't pursued it. For now I see this as a lower priority, but its something I'd like to investigate. As depending on tmpfs has issues since there's no quota support, so having a user-writable tmpfs partition mounted is a DoS opening, especially on low-memory systems. I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? True. If we avoid coalescing non-whole page ranges, keeping non-overlapping ranges independent is fairly easy. But it is also easy to avoid coalescing in all cases except when multiple sub-page ranges can be coalesced together. In other words, we mark whole page portions of the range as volatile, and keep the sub-page portions separate. So non-page aligned ranges would possibly consist of three independent ranges, with the middle one as the only one marked volatile. Should those non-whole-page ranges be adjacent to other non-whole-page ranges, they could be coalesced. Since the coalesced edge ranges would be marked volatile after the full range, we would also avoid puriging the edge pages that would invalidate two unpurged range. Alternatively, we can never coalesce and only mark whole pages in single ranges as volatile. It doesn't really make it more complex. But again, these are implementation details. The main point is I think at the user-interface level, allowing userland to provide non-page aligned ranges is valid. What we do with those non-page aligned chunks is up to the kernel/implementation, but I think we should be conservative and be sure never to purge non-volatile data. I wonder if the oldest page/oldest range issue can be defined way by requiring apps the touch the first page in a range when they touch the range. Then the age of a range is the age of the first page. Non-initial pages could even be kept off the free list though that might confuse NUMA page reclaim if a range had pages from different nodes. Not sure I followed this. Are you suggesting keeping non-initial ranges off the vmscan LRU lists entirely? Another appraoch that was suggested that sounds similar is touching all the pages when we mark them as volatile, so they are all close to each other in the active/inactive list. Then when the vmscan shrink_lru_list() code runs it would purge the pages together (although it might only purge half a range if there wasn't the need for more memory). But again, these page-based solutions have much higher algorithmic complexity (O(n) - with respect to pages marked) and overhead. Application to non-tmpfs files seems very unclear and so probably best avoided. If I understand you correctly, then you have suggested both that a volatile range would be a "lazy hole punch" and a "don't let this get written to disk yet" flag. It cannot really be both. The former sounds like fallocate, the latter like fadvise. I don't think I see the exclusivity aspect. If we say "Dear kernel, you may punch a hole at this offset in this file whenever you want
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On 9/28/2012 8:16 PM, John Stultz wrote: There is two rough approaches that I have tried so far 1) Managing volatile range objects, in a tree or list, which are then purged using a shrinker 2) Page based management, where pages marked volatile are moved to a new LRU list and are purged from there. 1) This patchset is of the the shrinker-based approach. In many ways it is simpler, but it does have a few drawbacks. Basically when marking a range as volatile, we create a range object, and add it to an rbtree. This allows us to be able to quickly find ranges, given an address in the file. We also add each range object to the tail of a filesystem global linked list, which acts as an LRU allowing us to quickly find the least recently created volatile range. We then use a shrinker callback to trigger purging, where we'll select the range on the head of the LRU list, purge the data, mark the range object as purged, and remove it from the lru list. This allows fairly efficient behavior, as marking and unmarking a range are both O(logn) operation with respect to the number of ranges, to insert and remove from the tree. Purging the range is also O(1) to select the range, and we purge the entire range in least-recently-marked-volatile order. The drawbacks with this approach is that it uses a shrinker, thus it is numa un-aware. We track the virtual address of the pages in the file, so we don't have a sense of what physical pages we're using, nor on which node those pages may be on. So its possible on a multi-node system that when one node was under pressure, we'd purge volatile ranges that are all on a different node, in effect throwing data away without helping anything. This is clearly non-ideal for numa systems. One idea I discussed with Michel Lespinasse is that this might be something we could improve by providing the shrinker some node context, then keep track in the range what node their first page is on. That way we would be sure to at least free up one page on the node under pressure when purging that range. 2) The second approach, which was more page based, was also tried. In this case when we marked a range as volatile, the pages in that range were moved to a new lru list LRU _VOLATILE in vmscan.c. This provided a page lru list that could be used to free pages before looking at the LRU_INACTIVE_FILE/ANONYMOUS lists. This integrates the feature deeper in the mm code, which is nice, especially as we have an LRU_VOLATILE list for each numa node. Thus under pressure we won't purge ranges that are entirely on a different node, as is possible with the other approach. However, this approach is more costly. When marking a range as volatile, we have to migrate every page in that range to the LRU_VOLATILE list, and similarly on unmarking we have to move each page back. This ends up being O(n) with respect to the number of pages in the range we're marking or unmarking. Similarly when purging, we let the scanning code select a page off the lru, then we have to map it back to the volatile range so we can purge the entire range, making it a more expensive O(logn), with respect to the number of ranges, operation. This is a particular concern as applications that want to mark and unmark data as volatile with fine granularity will likely be calling these operations frequently, adding quite a bit of overhead. This makes it less likely that applications will choose to volunteer data as volatile to the system. However, with the new lazy SIGBUS notification, applications using the SIGBUS method would avoid having to mark and unmark data when accessing it, so this overhead may be less of a concern. However, for cases where applications don't want to deal with the SIGBUS and would rather have the more deterministic behavior of the unmark/access/mark pattern, the performance is a concern. Unfortunately, approach 1 is not useful for our use-case. It'll mean that we are continuously re-decompressing frequently used parts of libxul.so under memory pressure(which is pretty often on limited ram devices). Taras ps. John, I really appreciate movement on this. We really need this to improve Firefox memory usage + startup speed on low memory devices. Will be great to have Firefox start faster+ respond to memory pressure better on desktop Linux too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz wrote: > > After Kernel Summit and Plumbers, I wanted to consider all the various > side-discussions and try to summarize my current thoughts here along > with sending out my current implementation for review. > > Also: I'm going on four weeks of paternity leave in the very near > (but non-deterministic) future. So while I hope I still have time > for some discussion, I may have to deal with fussier complaints > then yours. :) In any case, you'll have more time to chew on > the idea and come up with amazing suggestions. :) Hi John, I wonder if you are trying to please everyone and risking pleasing no-one? Well, maybe not quite that extreme, but you can't please all the people all the time. For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Similarly the suggestion of using madvise - while tempting - is probably a minority interest and can probably be managed with library code. I'm glad you haven't pursued it. I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? I wonder if the oldest page/oldest range issue can be defined way by requiring apps the touch the first page in a range when they touch the range. Then the age of a range is the age of the first page. Non-initial pages could even be kept off the free list though that might confuse NUMA page reclaim if a range had pages from different nodes. Application to non-tmpfs files seems very unclear and so probably best avoided. If I understand you correctly, then you have suggested both that a volatile range would be a "lazy hole punch" and a "don't let this get written to disk yet" flag. It cannot really be both. The former sounds like fallocate, the latter like fadvise. I think the later sounds more like the general purpose of volatile ranges, but I also suspect that some journalling filesystems might be uncomfortable providing a guarantee like that. So I would suggest firmly stating that it is a tmpfs-only feature. If someone wants something vaguely similar for other filesystems, let them implement it separately. The SIGBUS interface could have some merit if it really reduces overhead. I worry about app bugs that could result from the non-deterministic behaviour. A range could get unmapped while it is in use and testing for the case of "get a SIGBUS half way though accessing something" would not be straight forward (SIGBUS on first step of access should be easy). I guess that is up to the app writer, but I have never liked anything about the signal interface and encouraging further use doesn't feel wise. That's my 2c worth for now. Keep up the good work, NeilBrown signature.asc Description: PGP signature
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) Hi John, I wonder if you are trying to please everyone and risking pleasing no-one? Well, maybe not quite that extreme, but you can't please all the people all the time. For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Similarly the suggestion of using madvise - while tempting - is probably a minority interest and can probably be managed with library code. I'm glad you haven't pursued it. I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? I wonder if the oldest page/oldest range issue can be defined way by requiring apps the touch the first page in a range when they touch the range. Then the age of a range is the age of the first page. Non-initial pages could even be kept off the free list though that might confuse NUMA page reclaim if a range had pages from different nodes. Application to non-tmpfs files seems very unclear and so probably best avoided. If I understand you correctly, then you have suggested both that a volatile range would be a lazy hole punch and a don't let this get written to disk yet flag. It cannot really be both. The former sounds like fallocate, the latter like fadvise. I think the later sounds more like the general purpose of volatile ranges, but I also suspect that some journalling filesystems might be uncomfortable providing a guarantee like that. So I would suggest firmly stating that it is a tmpfs-only feature. If someone wants something vaguely similar for other filesystems, let them implement it separately. The SIGBUS interface could have some merit if it really reduces overhead. I worry about app bugs that could result from the non-deterministic behaviour. A range could get unmapped while it is in use and testing for the case of get a SIGBUS half way though accessing something would not be straight forward (SIGBUS on first step of access should be easy). I guess that is up to the app writer, but I have never liked anything about the signal interface and encouraging further use doesn't feel wise. That's my 2c worth for now. Keep up the good work, NeilBrown signature.asc Description: PGP signature
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On 9/28/2012 8:16 PM, John Stultz wrote: snip There is two rough approaches that I have tried so far 1) Managing volatile range objects, in a tree or list, which are then purged using a shrinker 2) Page based management, where pages marked volatile are moved to a new LRU list and are purged from there. 1) This patchset is of the the shrinker-based approach. In many ways it is simpler, but it does have a few drawbacks. Basically when marking a range as volatile, we create a range object, and add it to an rbtree. This allows us to be able to quickly find ranges, given an address in the file. We also add each range object to the tail of a filesystem global linked list, which acts as an LRU allowing us to quickly find the least recently created volatile range. We then use a shrinker callback to trigger purging, where we'll select the range on the head of the LRU list, purge the data, mark the range object as purged, and remove it from the lru list. This allows fairly efficient behavior, as marking and unmarking a range are both O(logn) operation with respect to the number of ranges, to insert and remove from the tree. Purging the range is also O(1) to select the range, and we purge the entire range in least-recently-marked-volatile order. The drawbacks with this approach is that it uses a shrinker, thus it is numa un-aware. We track the virtual address of the pages in the file, so we don't have a sense of what physical pages we're using, nor on which node those pages may be on. So its possible on a multi-node system that when one node was under pressure, we'd purge volatile ranges that are all on a different node, in effect throwing data away without helping anything. This is clearly non-ideal for numa systems. One idea I discussed with Michel Lespinasse is that this might be something we could improve by providing the shrinker some node context, then keep track in the range what node their first page is on. That way we would be sure to at least free up one page on the node under pressure when purging that range. 2) The second approach, which was more page based, was also tried. In this case when we marked a range as volatile, the pages in that range were moved to a new lru list LRU _VOLATILE in vmscan.c. This provided a page lru list that could be used to free pages before looking at the LRU_INACTIVE_FILE/ANONYMOUS lists. This integrates the feature deeper in the mm code, which is nice, especially as we have an LRU_VOLATILE list for each numa node. Thus under pressure we won't purge ranges that are entirely on a different node, as is possible with the other approach. However, this approach is more costly. When marking a range as volatile, we have to migrate every page in that range to the LRU_VOLATILE list, and similarly on unmarking we have to move each page back. This ends up being O(n) with respect to the number of pages in the range we're marking or unmarking. Similarly when purging, we let the scanning code select a page off the lru, then we have to map it back to the volatile range so we can purge the entire range, making it a more expensive O(logn), with respect to the number of ranges, operation. This is a particular concern as applications that want to mark and unmark data as volatile with fine granularity will likely be calling these operations frequently, adding quite a bit of overhead. This makes it less likely that applications will choose to volunteer data as volatile to the system. However, with the new lazy SIGBUS notification, applications using the SIGBUS method would avoid having to mark and unmark data when accessing it, so this overhead may be less of a concern. However, for cases where applications don't want to deal with the SIGBUS and would rather have the more deterministic behavior of the unmark/access/mark pattern, the performance is a concern. Unfortunately, approach 1 is not useful for our use-case. It'll mean that we are continuously re-decompressing frequently used parts of libxul.so under memory pressure(which is pretty often on limited ram devices). Taras ps. John, I really appreciate movement on this. We really need this to improve Firefox memory usage + startup speed on low memory devices. Will be great to have Firefox start faster+ respond to memory pressure better on desktop Linux too. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Volatile Ranges (v7) Lots of words
On 10/02/2012 12:39 AM, NeilBrown wrote: On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz john.stu...@linaro.org wrote: After Kernel Summit and Plumbers, I wanted to consider all the various side-discussions and try to summarize my current thoughts here along with sending out my current implementation for review. Also: I'm going on four weeks of paternity leave in the very near (but non-deterministic) future. So while I hope I still have time for some discussion, I may have to deal with fussier complaints then yours. :) In any case, you'll have more time to chew on the idea and come up with amazing suggestions. :) I wonder if you are trying to please everyone and risking pleasing no-one? Well, maybe not quite that extreme, but you can't please all the people all the time. So while I do agree that I won't be able to please everyone, especially when it comes to how this interface is implemented internally, I do want to make sure that the userland interface really does make sense and isn't limited by my own short-sightedness. :) For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Although if someone marked a page and a half as volatile, would it be reasonable to throw away the second half of that second page? That seems unexpected to me. So we're really only marking the whole pages specified as volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves. But if it happens that the adjacent range is also a partial page, we can coalesce them possibly into an purgable whole page. I think it makes sense, especially from a userland point of view and wasn't really complicated to add. Similarly the suggestion of using madvise - while tempting - is probably a minority interest and can probably be managed with library code. I'm glad you haven't pursued it. For now I see this as a lower priority, but its something I'd like to investigate. As depending on tmpfs has issues since there's no quota support, so having a user-writable tmpfs partition mounted is a DoS opening, especially on low-memory systems. I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? True. If we avoid coalescing non-whole page ranges, keeping non-overlapping ranges independent is fairly easy. But it is also easy to avoid coalescing in all cases except when multiple sub-page ranges can be coalesced together. In other words, we mark whole page portions of the range as volatile, and keep the sub-page portions separate. So non-page aligned ranges would possibly consist of three independent ranges, with the middle one as the only one marked volatile. Should those non-whole-page ranges be adjacent to other non-whole-page ranges, they could be coalesced. Since the coalesced edge ranges would be marked volatile after the full range, we would also avoid puriging the edge pages that would invalidate two unpurged range. Alternatively, we can never coalesce and only mark whole pages in single ranges as volatile. It doesn't really make it more complex. But again, these are implementation details. The main point is I think at the user-interface level, allowing userland to provide non-page aligned ranges is valid. What we do with those non-page aligned chunks is up to the kernel/implementation, but I think we should be conservative and be sure never to purge non-volatile data. I wonder if the oldest page/oldest range issue can be defined way by requiring apps the touch the first page in a range when they touch the range. Then the age of a range is the age of the first page. Non-initial pages could even be kept off the free list though that might confuse NUMA page reclaim if a range had pages from different nodes. Not sure I followed this. Are you suggesting keeping non-initial ranges off the vmscan LRU lists entirely? Another appraoch that was suggested that sounds similar is touching all the pages when we mark them as volatile, so they are all close to each other in the active/inactive list. Then when the vmscan shrink_lru_list() code runs it would purge the pages together (although it might only purge half a range if there wasn't the need for more memory). But again, these page-based solutions have much higher algorithmic complexity (O(n) - with respect to pages marked) and overhead. Application to non-tmpfs files seems very unclear and so probably best avoided. If I understand you correctly, then you have suggested both that a volatile range would be a lazy hole punch and a don't let this get written to disk yet flag. It cannot really be both. The former sounds like fallocate, the latter like fadvise. I don't think I see the exclusivity aspect. If we say Dear kernel, you may punch a hole at this offset in this file