Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 11:02:28PM -0400, Rik van Riel wrote: > On 05/08/2015 09:14 PM, Linus Torvalds wrote: > > On Fri, May 8, 2015 at 9:59 AM, Rik van Riel wrote: > >> > >> However, for persistent memory, all of the files will be "in memory". > > > > Yes. However, I doubt you will find a very sane rw filesystem that > > then also makes them contiguous and aligns them at 2MB boundaries. > > > > Anything is possible, I guess, but things like that are *hard*. The > > fragmentation issues etc cause it to a really challenging thing. > > The TLB performance bonus of accessing the large files with > large pages may make it worthwhile to solve that hard problem. FWIW, for DAX ththe filesystem allocation side is already mostly solved - this is just an allocation alignment hint, analogous to RAID stripe alignment. We don't need to reinvent the wheel here. i.e. On XFS, use a 2MB stripe unit for the fs, a 2MB extent size hint for files you want to use large pages on and you'll get 2MB sized and aligned allocations from the filesystem for as long as there are such freespace regions available. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 11:02:28PM -0400, Rik van Riel wrote: On 05/08/2015 09:14 PM, Linus Torvalds wrote: On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote: However, for persistent memory, all of the files will be in memory. Yes. However, I doubt you will find a very sane rw filesystem that then also makes them contiguous and aligns them at 2MB boundaries. Anything is possible, I guess, but things like that are *hard*. The fragmentation issues etc cause it to a really challenging thing. The TLB performance bonus of accessing the large files with large pages may make it worthwhile to solve that hard problem. FWIW, for DAX ththe filesystem allocation side is already mostly solved - this is just an allocation alignment hint, analogous to RAID stripe alignment. We don't need to reinvent the wheel here. i.e. On XFS, use a 2MB stripe unit for the fs, a 2MB extent size hint for files you want to use large pages on and you'll get 2MB sized and aligned allocations from the filesystem for as long as there are such freespace regions available. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 8:02 PM, Rik van Riel wrote: > > The TLB performance bonus of accessing the large files with > large pages may make it worthwhile to solve that hard problem. Very few people can actually measure that TLB advantage on systems with good TLB's. It's largely a myth, fed by some truly crappy TLB fill systems (particularly sw-filled TLB's on some early RISC CPU's, but even "modern" CPU's sometimes have glass jaws here because they cant' prefetch TLB entries or do concurrent page table walks etc). There are *very* few loads that actually have the kinds of access patterns where TLB accesses dominate - or are even noticeable - compared to the normal memory access costs. That is doubly true with file-backed storage. The main reason you get TLB costs to be noticeable is with very sparse access patterns, where you hit as many TLB entries as you hit pages. That simply doesn't happen with file mappings. Really. The whole thing about TLB advantages of hugepages is this almost entirely made-up stupid myth. You almost have to make up the benchmark for it (_that_ part is easy) to even see it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 09:14 PM, Linus Torvalds wrote: > On Fri, May 8, 2015 at 9:59 AM, Rik van Riel wrote: >> >> However, for persistent memory, all of the files will be "in memory". > > Yes. However, I doubt you will find a very sane rw filesystem that > then also makes them contiguous and aligns them at 2MB boundaries. > > Anything is possible, I guess, but things like that are *hard*. The > fragmentation issues etc cause it to a really challenging thing. The TLB performance bonus of accessing the large files with large pages may make it worthwhile to solve that hard problem. > And if they aren't aligned big contiguous allocations, then they > aren't relevant from any largepage cases. You'll still have to map > them 4k at a time etc. Absolutely, but we only need the 4k struct pages when the files are mapped. I suspect a lot of the files will just sit around idle, without being used. I am not convinced that the idea I wrote down earlier in this thread is worthwhile now, but it may turn out to be at some point in the future. It all depends on how much data people store on DAX filesystems, and how many files they have open at once. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 9:59 AM, Rik van Riel wrote: > > However, for persistent memory, all of the files will be "in memory". Yes. However, I doubt you will find a very sane rw filesystem that then also makes them contiguous and aligns them at 2MB boundaries. Anything is possible, I guess, but things like that are *hard*. The fragmentation issues etc cause it to a really challenging thing. And if they aren't aligned big contiguous allocations, then they aren't relevant from any largepage cases. You'll still have to map them 4k at a time etc. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
> "Linus" == Linus Torvalds writes: Linus> On Fri, May 8, 2015 at 7:40 AM, John Stoffel wrote: >> >> Now go and look at your /home or /data/ or /work areas, where the >> endusers are actually keeping their day to day work. Photos, mp3, >> design files, source code, object code littered around, etc. Linus> However, the big files in that list are almost immaterial from a Linus> caching standpoint. Linus> Caching source code is a big deal - just try not doing it and Linus> you'll figure it out. And the kernel C source files used to Linus> have a median size around 4k. Caching any files is a big deal, and if I'm doing batch edits of large jpegs, won't they get cached as well? Linus> The big files in your home directory? Let me make an educated Linus> guess. Very few to *none* of them are actually in your page Linus> cache right now. And you'd never even care if they ever made Linus> it into your page cache *at*all*. Much less whether you could Linus> ever cache them using large pages using some very fancy cache. Hmm... probably not honestly, since I'm not a home and not using the system actively right now. But I can see situations where being able to mix different page sizes efficiently might be a good thing. Linus> There are big files that care about caches, but they tend to be Linus> binaries, and for other reasons (things like randomization) you Linus> would never want to use largepages for those anyway. Or large design files, like my users at $WORK use, which can be 4Gb in size for a large design, which is ASIC chip layout work. So I'm a little bit in the minority there. And yes I do have other users will millions of itty bitty files as well. Linus> So from a page cache standpoint, I think the 4kB size still Linus> matters. A *lot*. largepages are a complete red herring, and Linus> will continue to be so pretty much forever (anonymous Linus> largepages perhaps less so). I think in the future, being able to efficiently mix page sizes will become useful, if only to lower the memory overhead of keeping track of large numbers of pages. John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 11:54 AM, Linus Torvalds wrote: > On Fri, May 8, 2015 at 7:40 AM, John Stoffel wrote: >> >> Now go and look at your /home or /data/ or /work areas, where the >> endusers are actually keeping their day to day work. Photos, mp3, >> design files, source code, object code littered around, etc. > > However, the big files in that list are almost immaterial from a > caching standpoint. > The big files in your home directory? Let me make an educated guess. > Very few to *none* of them are actually in your page cache right now. > And you'd never even care if they ever made it into your page cache > *at*all*. Much less whether you could ever cache them using large > pages using some very fancy cache. However, for persistent memory, all of the files will be "in memory". Not instantiating the 4kB struct pages for 2MB areas that are not currently being accessed with small files may make a difference. For dynamically allocated 4kB page structs, we need some way to discover where they are. It may make sense, from a simplicity point of view, to have one mechanism that works both for pmem and for normal system memory. I agree that 4kB granularity needs to continue to work pretty much forever, though. As long as people continue creating text files, they will just not be very large. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 08:54:06AM -0700, Linus Torvalds wrote: > However, the big files in that list are almost immaterial from a > caching standpoint. .git/objects/pack/* caching matters a lot, though... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 7:40 AM, John Stoffel wrote: > > Now go and look at your /home or /data/ or /work areas, where the > endusers are actually keeping their day to day work. Photos, mp3, > design files, source code, object code littered around, etc. However, the big files in that list are almost immaterial from a caching standpoint. Caching source code is a big deal - just try not doing it and you'll figure it out. And the kernel C source files used to have a median size around 4k. The big files in your home directory? Let me make an educated guess. Very few to *none* of them are actually in your page cache right now. And you'd never even care if they ever made it into your page cache *at*all*. Much less whether you could ever cache them using large pages using some very fancy cache. There are big files that care about caches, but they tend to be binaries, and for other reasons (things like randomization) you would never want to use largepages for those anyway. So from a page cache standpoint, I think the 4kB size still matters. A *lot*. largepages are a complete red herring, and will continue to be so pretty much forever (anonymous largepages perhaps less so). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 10:05 AM, Ingo Molnar wrote: > * Rik van Riel wrote: >> Memory trends point in one direction, file size trends in another. >> >> For persistent memory, we would not need 4kB page struct pages >> unless memory from a particular area was in small files AND those >> files were being actively accessed. [...] > > Average file size on my system's /usr is 12.5K: > > triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ > /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc > 12502 > >> [...] Large files (mapped in 2MB chunks) or inactive small files >> would not need the 4kB page structs around. > > ... they are the utter uncommon case. 4K is here to stay, and for a > very long time - until humans use computers I suspect. There's a bit of an 80/20 thing going on, though. The average file size may be small, but most data is used by large files. Additionally, a 2MB pmem area that has no small files on it that are currently open will also not need 4kB page structs. A system with 2TB of pmem might still only have a few thousand small files open at any point in time. The rest of the memory is either in large files, or in small files that have not been opened recently. We can reclaim the struct pages of 4kB pages that are not currently in use. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
> "Ingo" == Ingo Molnar writes: Ingo> * Rik van Riel wrote: >> The disadvantage is pretty obvious too: 4kB pages would no longer be >> the fast case, with an indirection. I do not know how much of an >> issue that would be, or whether it even makes sense for 4kB pages to >> continue being the fast case going forward. Ingo> I strongly disagree that 4kB does not matter as much: it is _the_ Ingo> bread and butter of 99% of Linux usecases. 4kB isn't going away Ingo> anytime soon - THP might look nice in benchmarks, but it does not Ingo> matter nearly as much in practice and for filesystems and IO it's Ingo> absolutely crazy to think about 2MB granularity. Ingo> Having said that, I don't think a single jump of indirection is a big Ingo> issue - except for the present case where all the pmem IO space is Ingo> mapped non-cacheable. Write-through caching patches are in the works Ingo> though, and that should make it plenty fast. >> Memory trends point in one direction, file size trends in another. >> >> For persistent memory, we would not need 4kB page struct pages >> unless memory from a particular area was in small files AND those >> files were being actively accessed. [...] Ingo> Average file size on my system's /usr is 12.5K: Ingo> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | Ingo> sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" Ingo> | wc -l; ) | bc 12502 Now go and look at your /home or /data/ or /work areas, where the endusers are actually keeping their day to day work. Photos, mp3, design files, source code, object code littered around, etc. Now I also have 12Tb filesystems with 30+ million files in them, which just *suck* for backup, esp incrementals. I have one monster with 85+ million files (time to get beat on users again ...) which needs to be pruned. So I'm not arguing against you, I'm just saying you need better more representative numbers across more day to day work. Running this exact same command against my home directory gets: 528989 So I'm not arguing one way or another... just providing numbers. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Rik van Riel wrote: > The disadvantage is pretty obvious too: 4kB pages would no longer be > the fast case, with an indirection. I do not know how much of an > issue that would be, or whether it even makes sense for 4kB pages to > continue being the fast case going forward. I strongly disagree that 4kB does not matter as much: it is _the_ bread and butter of 99% of Linux usecases. 4kB isn't going away anytime soon - THP might look nice in benchmarks, but it does not matter nearly as much in practice and for filesystems and IO it's absolutely crazy to think about 2MB granularity. Having said that, I don't think a single jump of indirection is a big issue - except for the present case where all the pmem IO space is mapped non-cacheable. Write-through caching patches are in the works though, and that should make it plenty fast. > Memory trends point in one direction, file size trends in another. > > For persistent memory, we would not need 4kB page struct pages > unless memory from a particular area was in small files AND those > files were being actively accessed. [...] Average file size on my system's /usr is 12.5K: triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc 12502 > [...] Large files (mapped in 2MB chunks) or inactive small files > would not need the 4kB page structs around. ... they are the utter uncommon case. 4K is here to stay, and for a very long time - until humans use computers I suspect. But I don't think the 2MB metadata chunking is wrong per se. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/07/2015 03:11 PM, Ingo Molnar wrote: > Stable, global page-struct descriptors are a given for real RAM, where > we allocate a struct page for every page in nice, large, mostly linear > arrays. > > We'd really need that for pmem too, to get the full power of struct > page: and that means allocating them in nice, large, predictable > places - such as on the device itself ... > > It might even be 'scattered' across the device, with 64 byte struct > page size we can pack 64 descriptors into a single page, so every 65 > pages we could have a page-struct page. > > Finding a pmem page's struct page would thus involve rounding it > modulo 65 and reading that page. > > The problem with that is fourfold: > > - that we now turn a very kernel internal API and data structure into >an ABI. If struct page grows beyond 64 bytes it's a problem. > > - on bootup (or device discovery time) we'd have to initialize all >the page structs. We could probably do this in a hierarchical way, >by dividing continuous pmem ranges into power-of-two groups of >blocks, and organizing them like the buddy allocator does. > > - 1.5% of storage space lost. > > - will wear-leveling properly migrate these 'hot' pages around? MST and I have been doing some thinking about how to address some of the issues above. One way could be to invert the PG_compound logic we have today, by allocating one struct page for every PMD / THP sized area (2MB on x86), and dynamically allocating struct pages for the 4kB pages inside only if the area gets split. They can be freed again when the area is not being accessed in 4kB chunks. That way we would always look at the struct page for the 2MB area first, and if the PG_split bit is set, we look at the array of dynamically allocated struct pages for this area. The advantages are obvious: boot time memory overhead and initialization time are reduced by a factor 512. CPUs could also take a whole 2MB area in order to do CPU-local 4kB allocations, defragmentation policies may become a little clearer, etc... The disadvantage is pretty obvious too: 4kB pages would no longer be the fast case, with an indirection. I do not know how much of an issue that would be, or whether it even makes sense for 4kB pages to continue being the fast case going forward. Memory trends point in one direction, file size trends in another. For persistent memory, we would not need 4kB page struct pages unless memory from a particular area was in small files AND those files were being actively accessed. Large files (mapped in 2MB chunks) or inactive small files would not need the 4kB page structs around. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 11:26:01AM +0200, Ingo Molnar wrote: > > * Al Viro wrote: > > > On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: > > > > > So if code does iov_iter_get_pages_alloc() on a user address that > > > has a real struct page behind it - and some other code does a > > > regular get_user_pages() on it, we'll have two sets of struct page > > > descriptors, the 'real' one, and a fake allocated one, right? > > > > Huh? iov_iter_get_pages() is given an array of pointers to struct > > page, which it fills with what it finds. iov_iter_get_pages_alloc() > > *allocates* such an array, fills that with what it finds and gives > > the allocated array to caller. > > > > We are not allocating any struct page instances in either of those. > > Ah, stupid me - thanks for the explanation! My fault, actually - this "pages array" should've been either "'pages' array" or "array of pointers to struct page". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Al Viro wrote: > On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: > > > So if code does iov_iter_get_pages_alloc() on a user address that > > has a real struct page behind it - and some other code does a > > regular get_user_pages() on it, we'll have two sets of struct page > > descriptors, the 'real' one, and a fake allocated one, right? > > Huh? iov_iter_get_pages() is given an array of pointers to struct > page, which it fills with what it finds. iov_iter_get_pages_alloc() > *allocates* such an array, fills that with what it finds and gives > the allocated array to caller. > > We are not allocating any struct page instances in either of those. Ah, stupid me - thanks for the explanation! Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: > same as iov_iter_get_pages(), except that pages array is allocated > (kmalloc if possible, vmalloc if that fails) and left for caller to > free. Lustre and NFS ->direct_IO() switched to it. > > Signed-off-by: Al Viro > > So if code does iov_iter_get_pages_alloc() on a user address that has > a real struct page behind it - and some other code does a regular > get_user_pages() on it, we'll have two sets of struct page > descriptors, the 'real' one, and a fake allocated one, right? Huh? iov_iter_get_pages() is given an array of pointers to struct page, which it fills with what it finds. iov_iter_get_pages_alloc() *allocates* such an array, fills that with what it finds and gives the allocated array to caller. We are not allocating any struct page instances in either of those. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS -direct_IO() switched to it. Signed-off-by: Al Viro v...@zeniv.linux.org.uk So if code does iov_iter_get_pages_alloc() on a user address that has a real struct page behind it - and some other code does a regular get_user_pages() on it, we'll have two sets of struct page descriptors, the 'real' one, and a fake allocated one, right? Huh? iov_iter_get_pages() is given an array of pointers to struct page, which it fills with what it finds. iov_iter_get_pages_alloc() *allocates* such an array, fills that with what it finds and gives the allocated array to caller. We are not allocating any struct page instances in either of those. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Al Viro v...@zeniv.linux.org.uk wrote: On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: So if code does iov_iter_get_pages_alloc() on a user address that has a real struct page behind it - and some other code does a regular get_user_pages() on it, we'll have two sets of struct page descriptors, the 'real' one, and a fake allocated one, right? Huh? iov_iter_get_pages() is given an array of pointers to struct page, which it fills with what it finds. iov_iter_get_pages_alloc() *allocates* such an array, fills that with what it finds and gives the allocated array to caller. We are not allocating any struct page instances in either of those. Ah, stupid me - thanks for the explanation! Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 11:26:01AM +0200, Ingo Molnar wrote: * Al Viro v...@zeniv.linux.org.uk wrote: On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote: So if code does iov_iter_get_pages_alloc() on a user address that has a real struct page behind it - and some other code does a regular get_user_pages() on it, we'll have two sets of struct page descriptors, the 'real' one, and a fake allocated one, right? Huh? iov_iter_get_pages() is given an array of pointers to struct page, which it fills with what it finds. iov_iter_get_pages_alloc() *allocates* such an array, fills that with what it finds and gives the allocated array to caller. We are not allocating any struct page instances in either of those. Ah, stupid me - thanks for the explanation! My fault, actually - this pages array should've been either 'pages' array or array of pointers to struct page. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 10:05 AM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: Memory trends point in one direction, file size trends in another. For persistent memory, we would not need 4kB page struct pages unless memory from a particular area was in small files AND those files were being actively accessed. [...] Average file size on my system's /usr is 12.5K: triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) | sed 's/ /+/g' | bc); echo -n /; find . -type f -printf %s\n | wc -l; ) | bc 12502 [...] Large files (mapped in 2MB chunks) or inactive small files would not need the 4kB page structs around. ... they are the utter uncommon case. 4K is here to stay, and for a very long time - until humans use computers I suspect. There's a bit of an 80/20 thing going on, though. The average file size may be small, but most data is used by large files. Additionally, a 2MB pmem area that has no small files on it that are currently open will also not need 4kB page structs. A system with 2TB of pmem might still only have a few thousand small files open at any point in time. The rest of the memory is either in large files, or in small files that have not been opened recently. We can reclaim the struct pages of 4kB pages that are not currently in use. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Rik van Riel r...@redhat.com wrote: The disadvantage is pretty obvious too: 4kB pages would no longer be the fast case, with an indirection. I do not know how much of an issue that would be, or whether it even makes sense for 4kB pages to continue being the fast case going forward. I strongly disagree that 4kB does not matter as much: it is _the_ bread and butter of 99% of Linux usecases. 4kB isn't going away anytime soon - THP might look nice in benchmarks, but it does not matter nearly as much in practice and for filesystems and IO it's absolutely crazy to think about 2MB granularity. Having said that, I don't think a single jump of indirection is a big issue - except for the present case where all the pmem IO space is mapped non-cacheable. Write-through caching patches are in the works though, and that should make it plenty fast. Memory trends point in one direction, file size trends in another. For persistent memory, we would not need 4kB page struct pages unless memory from a particular area was in small files AND those files were being actively accessed. [...] Average file size on my system's /usr is 12.5K: triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) | sed 's/ /+/g' | bc); echo -n /; find . -type f -printf %s\n | wc -l; ) | bc 12502 [...] Large files (mapped in 2MB chunks) or inactive small files would not need the 4kB page structs around. ... they are the utter uncommon case. 4K is here to stay, and for a very long time - until humans use computers I suspect. But I don't think the 2MB metadata chunking is wrong per se. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Ingo == Ingo Molnar mi...@kernel.org writes: Ingo * Rik van Riel r...@redhat.com wrote: The disadvantage is pretty obvious too: 4kB pages would no longer be the fast case, with an indirection. I do not know how much of an issue that would be, or whether it even makes sense for 4kB pages to continue being the fast case going forward. Ingo I strongly disagree that 4kB does not matter as much: it is _the_ Ingo bread and butter of 99% of Linux usecases. 4kB isn't going away Ingo anytime soon - THP might look nice in benchmarks, but it does not Ingo matter nearly as much in practice and for filesystems and IO it's Ingo absolutely crazy to think about 2MB granularity. Ingo Having said that, I don't think a single jump of indirection is a big Ingo issue - except for the present case where all the pmem IO space is Ingo mapped non-cacheable. Write-through caching patches are in the works Ingo though, and that should make it plenty fast. Memory trends point in one direction, file size trends in another. For persistent memory, we would not need 4kB page struct pages unless memory from a particular area was in small files AND those files were being actively accessed. [...] Ingo Average file size on my system's /usr is 12.5K: Ingo triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) | Ingo sed 's/ /+/g' | bc); echo -n /; find . -type f -printf %s\n Ingo | wc -l; ) | bc 12502 Now go and look at your /home or /data/ or /work areas, where the endusers are actually keeping their day to day work. Photos, mp3, design files, source code, object code littered around, etc. Now I also have 12Tb filesystems with 30+ million files in them, which just *suck* for backup, esp incrementals. I have one monster with 85+ million files (time to get beat on users again ...) which needs to be pruned. So I'm not arguing against you, I'm just saying you need better more representative numbers across more day to day work. Running this exact same command against my home directory gets: 528989 So I'm not arguing one way or another... just providing numbers. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/07/2015 03:11 PM, Ingo Molnar wrote: Stable, global page-struct descriptors are a given for real RAM, where we allocate a struct page for every page in nice, large, mostly linear arrays. We'd really need that for pmem too, to get the full power of struct page: and that means allocating them in nice, large, predictable places - such as on the device itself ... It might even be 'scattered' across the device, with 64 byte struct page size we can pack 64 descriptors into a single page, so every 65 pages we could have a page-struct page. Finding a pmem page's struct page would thus involve rounding it modulo 65 and reading that page. The problem with that is fourfold: - that we now turn a very kernel internal API and data structure into an ABI. If struct page grows beyond 64 bytes it's a problem. - on bootup (or device discovery time) we'd have to initialize all the page structs. We could probably do this in a hierarchical way, by dividing continuous pmem ranges into power-of-two groups of blocks, and organizing them like the buddy allocator does. - 1.5% of storage space lost. - will wear-leveling properly migrate these 'hot' pages around? MST and I have been doing some thinking about how to address some of the issues above. One way could be to invert the PG_compound logic we have today, by allocating one struct page for every PMD / THP sized area (2MB on x86), and dynamically allocating struct pages for the 4kB pages inside only if the area gets split. They can be freed again when the area is not being accessed in 4kB chunks. That way we would always look at the struct page for the 2MB area first, and if the PG_split bit is set, we look at the array of dynamically allocated struct pages for this area. The advantages are obvious: boot time memory overhead and initialization time are reduced by a factor 512. CPUs could also take a whole 2MB area in order to do CPU-local 4kB allocations, defragmentation policies may become a little clearer, etc... The disadvantage is pretty obvious too: 4kB pages would no longer be the fast case, with an indirection. I do not know how much of an issue that would be, or whether it even makes sense for 4kB pages to continue being the fast case going forward. Memory trends point in one direction, file size trends in another. For persistent memory, we would not need 4kB page struct pages unless memory from a particular area was in small files AND those files were being actively accessed. Large files (mapped in 2MB chunks) or inactive small files would not need the 4kB page structs around. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 08, 2015 at 08:54:06AM -0700, Linus Torvalds wrote: However, the big files in that list are almost immaterial from a caching standpoint. .git/objects/pack/* caching matters a lot, though... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 11:54 AM, Linus Torvalds wrote: On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote: Now go and look at your /home or /data/ or /work areas, where the endusers are actually keeping their day to day work. Photos, mp3, design files, source code, object code littered around, etc. However, the big files in that list are almost immaterial from a caching standpoint. The big files in your home directory? Let me make an educated guess. Very few to *none* of them are actually in your page cache right now. And you'd never even care if they ever made it into your page cache *at*all*. Much less whether you could ever cache them using large pages using some very fancy cache. However, for persistent memory, all of the files will be in memory. Not instantiating the 4kB struct pages for 2MB areas that are not currently being accessed with small files may make a difference. For dynamically allocated 4kB page structs, we need some way to discover where they are. It may make sense, from a simplicity point of view, to have one mechanism that works both for pmem and for normal system memory. I agree that 4kB granularity needs to continue to work pretty much forever, though. As long as people continue creating text files, they will just not be very large. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote: Now go and look at your /home or /data/ or /work areas, where the endusers are actually keeping their day to day work. Photos, mp3, design files, source code, object code littered around, etc. However, the big files in that list are almost immaterial from a caching standpoint. Caching source code is a big deal - just try not doing it and you'll figure it out. And the kernel C source files used to have a median size around 4k. The big files in your home directory? Let me make an educated guess. Very few to *none* of them are actually in your page cache right now. And you'd never even care if they ever made it into your page cache *at*all*. Much less whether you could ever cache them using large pages using some very fancy cache. There are big files that care about caches, but they tend to be binaries, and for other reasons (things like randomization) you would never want to use largepages for those anyway. So from a page cache standpoint, I think the 4kB size still matters. A *lot*. largepages are a complete red herring, and will continue to be so pretty much forever (anonymous largepages perhaps less so). Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Linus == Linus Torvalds torva...@linux-foundation.org writes: Linus On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote: Now go and look at your /home or /data/ or /work areas, where the endusers are actually keeping their day to day work. Photos, mp3, design files, source code, object code littered around, etc. Linus However, the big files in that list are almost immaterial from a Linus caching standpoint. Linus Caching source code is a big deal - just try not doing it and Linus you'll figure it out. And the kernel C source files used to Linus have a median size around 4k. Caching any files is a big deal, and if I'm doing batch edits of large jpegs, won't they get cached as well? Linus The big files in your home directory? Let me make an educated Linus guess. Very few to *none* of them are actually in your page Linus cache right now. And you'd never even care if they ever made Linus it into your page cache *at*all*. Much less whether you could Linus ever cache them using large pages using some very fancy cache. Hmm... probably not honestly, since I'm not a home and not using the system actively right now. But I can see situations where being able to mix different page sizes efficiently might be a good thing. Linus There are big files that care about caches, but they tend to be Linus binaries, and for other reasons (things like randomization) you Linus would never want to use largepages for those anyway. Or large design files, like my users at $WORK use, which can be 4Gb in size for a large design, which is ASIC chip layout work. So I'm a little bit in the minority there. And yes I do have other users will millions of itty bitty files as well. Linus So from a page cache standpoint, I think the 4kB size still Linus matters. A *lot*. largepages are a complete red herring, and Linus will continue to be so pretty much forever (anonymous Linus largepages perhaps less so). I think in the future, being able to efficiently mix page sizes will become useful, if only to lower the memory overhead of keeping track of large numbers of pages. John -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/08/2015 09:14 PM, Linus Torvalds wrote: On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote: However, for persistent memory, all of the files will be in memory. Yes. However, I doubt you will find a very sane rw filesystem that then also makes them contiguous and aligns them at 2MB boundaries. Anything is possible, I guess, but things like that are *hard*. The fragmentation issues etc cause it to a really challenging thing. The TLB performance bonus of accessing the large files with large pages may make it worthwhile to solve that hard problem. And if they aren't aligned big contiguous allocations, then they aren't relevant from any largepage cases. You'll still have to map them 4k at a time etc. Absolutely, but we only need the 4k struct pages when the files are mapped. I suspect a lot of the files will just sit around idle, without being used. I am not convinced that the idea I wrote down earlier in this thread is worthwhile now, but it may turn out to be at some point in the future. It all depends on how much data people store on DAX filesystems, and how many files they have open at once. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 8:02 PM, Rik van Riel r...@redhat.com wrote: The TLB performance bonus of accessing the large files with large pages may make it worthwhile to solve that hard problem. Very few people can actually measure that TLB advantage on systems with good TLB's. It's largely a myth, fed by some truly crappy TLB fill systems (particularly sw-filled TLB's on some early RISC CPU's, but even modern CPU's sometimes have glass jaws here because they cant' prefetch TLB entries or do concurrent page table walks etc). There are *very* few loads that actually have the kinds of access patterns where TLB accesses dominate - or are even noticeable - compared to the normal memory access costs. That is doubly true with file-backed storage. The main reason you get TLB costs to be noticeable is with very sparse access patterns, where you hit as many TLB entries as you hit pages. That simply doesn't happen with file mappings. Really. The whole thing about TLB advantages of hugepages is this almost entirely made-up stupid myth. You almost have to make up the benchmark for it (_that_ part is easy) to even see it. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote: However, for persistent memory, all of the files will be in memory. Yes. However, I doubt you will find a very sane rw filesystem that then also makes them contiguous and aligns them at 2MB boundaries. Anything is possible, I guess, but things like that are *hard*. The fragmentation issues etc cause it to a really challenging thing. And if they aren't aligned big contiguous allocations, then they aren't relevant from any largepage cases. You'll still have to map them 4k at a time etc. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Al, I was wondering about the struct page rules of iov_iter_get_pages_alloc(), used in various places. There's no documentation whatsoever in lib/iov_iter.c, nor in include/linux/uio.h, and the changelog that introduced it only says: commit 91f79c43d1b54d7154b118860d81b39bad07dfff Author: Al Viro Date: Fri Mar 21 04:58:33 2014 -0400 new helper: iov_iter_get_pages_alloc() same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS ->direct_IO() switched to it. Signed-off-by: Al Viro So if code does iov_iter_get_pages_alloc() on a user address that has a real struct page behind it - and some other code does a regular get_user_pages() on it, we'll have two sets of struct page descriptors, the 'real' one, and a fake allocated one, right? How does that work? Nobody else can ever discover these fake page structs, so they don't really serve any 'real' synchronization purpose other than the limited role of IO completion. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 09:53:13PM +0200, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > > Is handling kernel pagefault on the vmemmap completely out of the > > > picture ? So we would carveout a chunck of kernel address space > > > for those pfn and use it for vmemmap and handle pagefault on it. > > > > That's pretty clever. The page fault doesn't even have to do remote > > TLB shootdown, because it only establishes mappings - so it's pretty > > atomic, a bit like the minor vmalloc() area faults we are doing. > > > > Some sort of LRA (least recently allocated) scheme could unmap the > > area in chunks if it's beyond a certain size, to keep a limit on > > size. Done from the same context and would use remote TLB shootdown. > > > > The only limitation I can see is that such faults would have to be > > able to sleep, to do the allocation. So pfn_to_page() could not be > > used in arbitrary contexts. > > So another complication would be that we cannot just unmap such pages > when we want to recycle them, because the struct page in them might be > in use - so all struct page uses would have to refcount the underlying > page. We don't really do that today: code just looks up struct pages > and assumes they never go away. I still think this is doable, like i said in another email, i think we should introduce a special pfn_to_page_dev|pmem|waffle|somethingyoulike() to place that are allowed to allocate the underlying struct page. For instance we can use a default page to backup all this special vmem range with some specialy crafted struct page that says that its is invalid memory (make this mapping read only so all write to this special struct page is forbidden). Now once an authorized user comes along and need a real struct page it trigger a page allocation that replace the page full of fake invalid struct page with a page with correct valid struct page that can be manipulated by other part of the kernel. So regular pfn_to_page() would test against special vmemmap and if special test the content of struct page for some flag. If it's the invalid page flag it returns 0. But once a proper struct page is allocated then pfn_page would return the struct page as expected. That way you will catch all invalid user of such page ie user that use the page after its lifetime is done. You will also limit the creation of the underlying proper struct page to only code that are legitimate to ask for a proper struct page for given pfn. Also you would get kernel write fault on the page full of fake struct page and that would allow to catch further wrong use. Anyway this is how i envision this and i think it would work for my usecase too (GPU it is for me :)) Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds wrote: > On Thu, May 7, 2015 at 9:03 AM, Dan Williams wrote: >> >> Ok, I'll keep thinking about this and come back when we have a better >> story about passing mmap'd persistent memory around in userspace. > > Ok. And if we do decide to go with your kind of "__pfn" type, I'd > probably prefer that we encode the type in the low bits of the word > rather than compare against PAGE_OFFSET. On some architectures > PAGE_OFFSET is zero (admittedly probably not ones you'd care about), > but even on x86 it's a *lot* cheaper to test the low bit than it is to > compare against a big constant. > > We know "struct page *" is supposed to be at least aligned to at least > "unsigned long", so you'd have two bits of type information (and we > could easily make it three). With "0" being a real pointer, so that > you can use the pointer itself without masking. > > And the "hide type in low bits of pointer" is something we've done > quite a lot, so it's more "kernel coding style" anyway. Ok. Although __pfn_t also stores pfn values directly which will consume those 2 bits so we'll need to shift pfns up when storing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Ingo Molnar wrote: > > Is handling kernel pagefault on the vmemmap completely out of the > > picture ? So we would carveout a chunck of kernel address space > > for those pfn and use it for vmemmap and handle pagefault on it. > > That's pretty clever. The page fault doesn't even have to do remote > TLB shootdown, because it only establishes mappings - so it's pretty > atomic, a bit like the minor vmalloc() area faults we are doing. > > Some sort of LRA (least recently allocated) scheme could unmap the > area in chunks if it's beyond a certain size, to keep a limit on > size. Done from the same context and would use remote TLB shootdown. > > The only limitation I can see is that such faults would have to be > able to sleep, to do the allocation. So pfn_to_page() could not be > used in arbitrary contexts. So another complication would be that we cannot just unmap such pages when we want to recycle them, because the struct page in them might be in use - so all struct page uses would have to refcount the underlying page. We don't really do that today: code just looks up struct pages and assumes they never go away. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Jerome Glisse wrote: > > So I think the main value of struct page is if everyone on the > > system sees the same struct page for the same pfn - not just the > > temporary IO instance. > > > > The idea of having very temporary struct page arrays misses the > > point I think: if struct page is used as essentially an IO sglist > > then most of the synchronization properties are lost: then we > > might as well use the real deal in that case and skip the dynamic > > allocation and use pfns directly and avoid the dynamic allocation > > overhead. > > > > Stable, global page-struct descriptors are a given for real RAM, > > where we allocate a struct page for every page in nice, large, > > mostly linear arrays. > > > > We'd really need that for pmem too, to get the full power of > > struct page: and that means allocating them in nice, large, > > predictable places - such as on the device itself ... > > Is handling kernel pagefault on the vmemmap completely out of the > picture ? So we would carveout a chunck of kernel address space for > those pfn and use it for vmemmap and handle pagefault on it. That's pretty clever. The page fault doesn't even have to do remote TLB shootdown, because it only establishes mappings - so it's pretty atomic, a bit like the minor vmalloc() area faults we are doing. Some sort of LRA (least recently allocated) scheme could unmap the area in chunks if it's beyond a certain size, to keep a limit on size. Done from the same context and would use remote TLB shootdown. The only limitation I can see is that such faults would have to be able to sleep, to do the allocation. So pfn_to_page() could not be used in arbitrary contexts. > Again here i think that GPU folks would like a solution where they > can have a page struct but it would not be PMEM just device memory. > So if we can come up with something generic enough to server both > purpose that would be better in my view. Yes. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar wrote: > > * Dan Williams wrote: > >> On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig wrote: >> > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: >> >> What is the primary thing that is driving this need? Do we have a very >> >> concrete example? >> > >> > FYI, I plan to to implement RAID acceleration using nvdimms, and I >> > plan to ue pages for that. The code just merge for 4.1 can easily >> > support page backing, and I plan to use that for now. This still >> > leaves support for the gigantic intel nvdimms discovered over EFI >> > out, but given that I don't have access to them, and I dont know >> > of any publically available there's little I can do for now. But >> > adding on demand allocate struct pages for the seems like the >> > easiest way forward. Boaz already has code to allocate pages for >> > them, although not on demand but at boot / plug in time. >> >> Hmmm, the capacities of persistent memory that would be assigned for >> a raid accelerator would be limited by diminishing returns. I.e. >> there seems to be no point to assign more than 8GB or so to the >> cache? [...] > > Why would that be the case? > > If it's not a temporary cache but a persistent cache that hosts all > the data even after writeback completes then going to huge sizes will > bring similar benefits to using a large, fast SSD disk on your > desktop... The larger, the better. And it also persists across > reboots. True, that's more "dm-cache" than "RAID accelerator", but point taken. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 09:11:07PM +0200, Ingo Molnar wrote: > > * Dave Hansen wrote: > > > On 05/07/2015 10:42 AM, Dan Williams wrote: > > > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar wrote: > > >> * Dan Williams wrote: > > >> > > >> So is there anything fundamentally wrong about creating struct > > >> page backing at mmap() time (and making sure aliased mmaps share > > >> struct page arrays)? > > > > > > Something like "get_user_pages() triggers memory hotplug for > > > persistent memory", so they are actual real struct pages? Can we > > > do memory hotplug at that granularity? > > > > We've traditionally limited them to SECTION_SIZE granularity, which > > is 128MB IIRC. There are also assumptions in places that you can do > > page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. > > I really don't think that's very practical: memory hotplug is slow, > it's really not on the same abstraction level as mmap(), and the zone > data structures are also fundamentally very coarse: not just because > RAM ranges are huge, but also so that the pfn->page transformation > stays relatively simple and fast. > > > But, in all practicality, a lot of those places are in code like the > > buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're > > not ever expecting these fake 'struct page's to hit these code > > paths, it probably doesn't matter. > > > > You can probably get away with just allocating PAGE_SIZE worth of > > 'struct page' (which is 64) and mapping it in to vmemmap[]. The > > worst case is that you'll eat 1 page of space for each outstanding > > page of I/O. That's a lot better than 2MB of temporary 'struct > > page' space per page of I/O that it would take with a traditional > > hotplug operation. > > So I think the main value of struct page is if everyone on the system > sees the same struct page for the same pfn - not just the temporary IO > instance. > > The idea of having very temporary struct page arrays misses the point > I think: if struct page is used as essentially an IO sglist then most > of the synchronization properties are lost: then we might as well use > the real deal in that case and skip the dynamic allocation and use > pfns directly and avoid the dynamic allocation overhead. > > Stable, global page-struct descriptors are a given for real RAM, where > we allocate a struct page for every page in nice, large, mostly linear > arrays. > > We'd really need that for pmem too, to get the full power of struct > page: and that means allocating them in nice, large, predictable > places - such as on the device itself ... Is handling kernel pagefault on the vmemmap completely out of the picture ? So we would carveout a chunck of kernel address space for those pfn and use it for vmemmap and handle pagefault on it. Again here i think that GPU folks would like a solution where they can have a page struct but it would not be PMEM just device memory. So if we can come up with something generic enough to server both purpose that would be better in my view. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dave Hansen wrote: > On 05/07/2015 10:42 AM, Dan Williams wrote: > > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar wrote: > >> * Dan Williams wrote: > >> > >> So is there anything fundamentally wrong about creating struct > >> page backing at mmap() time (and making sure aliased mmaps share > >> struct page arrays)? > > > > Something like "get_user_pages() triggers memory hotplug for > > persistent memory", so they are actual real struct pages? Can we > > do memory hotplug at that granularity? > > We've traditionally limited them to SECTION_SIZE granularity, which > is 128MB IIRC. There are also assumptions in places that you can do > page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. I really don't think that's very practical: memory hotplug is slow, it's really not on the same abstraction level as mmap(), and the zone data structures are also fundamentally very coarse: not just because RAM ranges are huge, but also so that the pfn->page transformation stays relatively simple and fast. > But, in all practicality, a lot of those places are in code like the > buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're > not ever expecting these fake 'struct page's to hit these code > paths, it probably doesn't matter. > > You can probably get away with just allocating PAGE_SIZE worth of > 'struct page' (which is 64) and mapping it in to vmemmap[]. The > worst case is that you'll eat 1 page of space for each outstanding > page of I/O. That's a lot better than 2MB of temporary 'struct > page' space per page of I/O that it would take with a traditional > hotplug operation. So I think the main value of struct page is if everyone on the system sees the same struct page for the same pfn - not just the temporary IO instance. The idea of having very temporary struct page arrays misses the point I think: if struct page is used as essentially an IO sglist then most of the synchronization properties are lost: then we might as well use the real deal in that case and skip the dynamic allocation and use pfns directly and avoid the dynamic allocation overhead. Stable, global page-struct descriptors are a given for real RAM, where we allocate a struct page for every page in nice, large, mostly linear arrays. We'd really need that for pmem too, to get the full power of struct page: and that means allocating them in nice, large, predictable places - such as on the device itself ... It might even be 'scattered' across the device, with 64 byte struct page size we can pack 64 descriptors into a single page, so every 65 pages we could have a page-struct page. Finding a pmem page's struct page would thus involve rounding it modulo 65 and reading that page. The problem with that is fourfold: - that we now turn a very kernel internal API and data structure into an ABI. If struct page grows beyond 64 bytes it's a problem. - on bootup (or device discovery time) we'd have to initialize all the page structs. We could probably do this in a hierarchical way, by dividing continuous pmem ranges into power-of-two groups of blocks, and organizing them like the buddy allocator does. - 1.5% of storage space lost. - will wear-leveling properly migrate these 'hot' pages around? The alternative would be some global interval-rbtree of struct page backed pmem ranges. Beyond the synchronization problems of such a data structure (which looks like a nightmare) I don't think it's even feasible: especially if there's a filesystem on the pmem device then the block allocations could be physically fragmented (and there's no fundamental reason why they couldn't be fragmented), so a continuous mmap() of a file on it will yield wildly fragmented device-pfn ranges, exploding the rbtree. Think 1 million node interval-rbtree with an average depth of 20: cachemiss country for even simple lookups - not to mention the freeing/recycling complexity of unused struct pages to not allow it to grow too large. I might be wrong though about all this :) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams wrote: > On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig wrote: > > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: > >> What is the primary thing that is driving this need? Do we have a very > >> concrete example? > > > > FYI, I plan to to implement RAID acceleration using nvdimms, and I > > plan to ue pages for that. The code just merge for 4.1 can easily > > support page backing, and I plan to use that for now. This still > > leaves support for the gigantic intel nvdimms discovered over EFI > > out, but given that I don't have access to them, and I dont know > > of any publically available there's little I can do for now. But > > adding on demand allocate struct pages for the seems like the > > easiest way forward. Boaz already has code to allocate pages for > > them, although not on demand but at boot / plug in time. > > Hmmm, the capacities of persistent memory that would be assigned for > a raid accelerator would be limited by diminishing returns. I.e. > there seems to be no point to assign more than 8GB or so to the > cache? [...] Why would that be the case? If it's not a temporary cache but a persistent cache that hosts all the data even after writeback completes then going to huge sizes will bring similar benefits to using a large, fast SSD disk on your desktop... The larger, the better. And it also persists across reboots. It could also host the RAID write intent bitmap (the dirty stripes/chunks bitmap) for extra speedups. (This bitmap is pretty small, but important to speed up resyncs after crashes or power loss.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/07/2015 10:42 AM, Dan Williams wrote: > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar wrote: >> * Dan Williams wrote: >> So is there anything fundamentally wrong about creating struct page >> backing at mmap() time (and making sure aliased mmaps share struct >> page arrays)? > > Something like "get_user_pages() triggers memory hotplug for > persistent memory", so they are actual real struct pages? Can we do > memory hotplug at that granularity? We've traditionally limited them to SECTION_SIZE granularity, which is 128MB IIRC. There are also assumptions in places that you can do page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. But, in all practicality, a lot of those places are in code like the buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're not ever expecting these fake 'struct page's to hit these code paths, it probably doesn't matter. You can probably get away with just allocating PAGE_SIZE worth of 'struct page' (which is 64) and mapping it in to vmemmap[]. The worst case is that you'll eat 1 page of space for each outstanding page of I/O. That's a lot better than 2MB of temporary 'struct page' space per page of I/O that it would take with a traditional hotplug operation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams wrote: > > That looks like a layering violation and a mistake to me. If we > > want to do direct (sector_t -> sector_t) IO, with no serialization > > worries, it should have its own (simple) API - which things like > > hierarchical RAID or RDMA APIs could use. > > I'm wrapped around the idea that __pfn_t *is* that simple api for > the tiered storage driver use case. [...] I agree. (see my previous mail) > [...] For RDMA I think we need struct page because I assume that > would be coordinated through a filesystem an truncate() is back in > play. So I don't think RDMA is necessarily special, it's just a weirdly programmed DMA request: - If it is used internally by an exclusively managed complex storage driver, then it can use low level block APIs and pfn_t. - If RDMA is exposed all the way to user-space (do we have such APIs?), allowing users to initiate RDMA IO into user buffers, then (the user visible) buffer needs struct page backing. (which in turn will then at some lower level convert to pfns.) That's true for both regular RAM pages and mmap()-ed persistent RAM pages as well. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 9:03 AM, Dan Williams wrote: > > Ok, I'll keep thinking about this and come back when we have a better > story about passing mmap'd persistent memory around in userspace. Ok. And if we do decide to go with your kind of "__pfn" type, I'd probably prefer that we encode the type in the low bits of the word rather than compare against PAGE_OFFSET. On some architectures PAGE_OFFSET is zero (admittedly probably not ones you'd care about), but even on x86 it's a *lot* cheaper to test the low bit than it is to compare against a big constant. We know "struct page *" is supposed to be at least aligned to at least "unsigned long", so you'd have two bits of type information (and we could easily make it three). With "0" being a real pointer, so that you can use the pointer itself without masking. And the "hide type in low bits of pointer" is something we've done quite a lot, so it's more "kernel coding style" anyway. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar wrote: > > * Dan Williams wrote: > >> > Anyway, I did want to say that while I may not be convinced about >> > the approach, I think the patches themselves don't look horrible. >> > I actually like your "__pfn_t". So while I (very obviously) have >> > some doubts about this approach, it may be that the most >> > convincing argument is just in the code. >> >> Ok, I'll keep thinking about this and come back when we have a >> better story about passing mmap'd persistent memory around in >> userspace. > > So is there anything fundamentally wrong about creating struct page > backing at mmap() time (and making sure aliased mmaps share struct > page arrays)? Something like "get_user_pages() triggers memory hotplug for persistent memory", so they are actual real struct pages? Can we do memory hotplug at that granularity? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams wrote: > > Anyway, I did want to say that while I may not be convinced about > > the approach, I think the patches themselves don't look horrible. > > I actually like your "__pfn_t". So while I (very obviously) have > > some doubts about this approach, it may be that the most > > convincing argument is just in the code. > > Ok, I'll keep thinking about this and come back when we have a > better story about passing mmap'd persistent memory around in > userspace. So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Because if that is done, then the DMA agent won't even know about the memory being persistent RAM. It's just a regular struct page, that happens to point to persistent RAM. Same goes for all the high level VM APIs, futexes, etc. Everything will Just Work. It will also be relatively fast: mmap() is a relative slowpath, comparatively. As far as RAID is concerned: that's a relatively easy situation, as there's only a single user of the devices, the RAID context that manages all component devices exclusively. Device to device DMA can use the block layer directly, i.e. most of the patches you've got here in this series, except: 74287 C May 06 Dan Williams( 232) ├─>[PATCH v2 09/10] dax: convert to __pfn_t I think DAX mmap()s need struct page backing. I think there's a simple rule: if a page is visible to user-space via the MMU then it needs struct page backing. If it's "hidden", like behind a RAID abstraction, it probably doesn't. With the remaining patches a high level RAID driver ought to be able to send pfn-to-sector and sector-to-pfn requests to other block drivers, without any unnecessary struct page allocation overhead, right? As long as the pfn concept remains a clever way to reuse our ram<->sector interfaces to implement sector<->sector IO, in the cases where the IO has no serialization or MMU concerns, not using struct page and using pfn_t looks natural. The moment it starts reaching user space APIs, like in the DAX case, and especially if it becomes user-MMU visible, it's a mistake to not have struct page backing, I think. (In that sense the current DAX mmap() code is already a partial mistake.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 06:18:07PM +0200, Christoph Hellwig wrote: > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: > > What is the primary thing that is driving this need? Do we have a very > > concrete example? > > FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to > ue pages for that. The code just merge for 4.1 can easily support page > backing, and I plan to use that for now. This still leaves support > for the gigantic intel nvdimms discovered over EFI out, but given that > I don't have access to them, and I dont know of any publically available > there's little I can do for now. But adding on demand allocate struct > pages for the seems like the easiest way forward. Boaz already has > code to allocate pages for them, although not on demand but at boot / plug in > time. I think here other folks might be interested, i am ccing Paul. But for GPU we are facing similar issue of trying to present the GPU memory to the kernel in a coherent way (coherent from the design and linux kernel concept POV). For this dynamicaly allocated struct page might effectively be a solution that could be share btw persistent memory and GPU folks. We can even enforce thing like VMEMMAP and have special region carveout where we can dynamicly map/unmap backing page for range of device pfn. This would also allow to catch people trying to access such page, we could add a set of new helper like : get_page_dev()/put_page_dev() ... and only the _dev version would works on this new kind of memory, regular get_page()/put_page() would throw error. This should allow to make sure only legitimate users are referencing such page. Issue might be that we can run out of kernel address space with 48bits but if such monstruous computer ever see the light of day they might consider using CPU with more bits. Another issue is that we might care for the 32bits platform too, but that's solvable at a small cost. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig wrote: > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: >> What is the primary thing that is driving this need? Do we have a very >> concrete example? > > FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to > ue pages for that. The code just merge for 4.1 can easily support page > backing, and I plan to use that for now. This still leaves support > for the gigantic intel nvdimms discovered over EFI out, but given that > I don't have access to them, and I dont know of any publically available > there's little I can do for now. But adding on demand allocate struct > pages for the seems like the easiest way forward. Boaz already has > code to allocate pages for them, although not on demand but at boot / plug in > time. Hmmm, the capacities of persistent memory that would be assigned for a raid accelerator would be limited by diminishing returns. I.e. there seems to be no point to assign more than 8GB or so to the cache? If that's the case the capacity argument loses some teeth, just "blk_get(FMODE_EXCL) + memory_hotplug a small capacity" and be done. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: > What is the primary thing that is driving this need? Do we have a very > concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds wrote: > On Thu, May 7, 2015 at 8:40 AM, Dan Williams wrote: >> >> blkdev_get(FMODE_EXCL) is the protection in this case. > > Ugh. That looks like a horrible nasty big hammer that will bite us > badly some day. Since you'd have to hold it for the whole IO. But I > guess it at least works. Oh no, that wouldn't be per-I/O that would be permanent at configuration set up time just like a raid member device. Something like: mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda > Anyway, I did want to say that while I may not be convinced about the > approach, I think the patches themselves don't look horrible. I > actually like your "__pfn_t". So while I (very obviously) have some > doubts about this approach, it may be that the most convincing > argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:40 AM, Dan Williams wrote: > > blkdev_get(FMODE_EXCL) is the protection in this case. Ugh. That looks like a horrible nasty big hammer that will bite us badly some day. Since you'd have to hold it for the whole IO. But I guess it at least works. Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your "__pfn_t". So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> [...] >> >> For anything more complex, that maps any of this storage to >> user-space, or exposes it to higher level struct page based APIs, >> etc., where references matter and it's more of a cache with >> potentially multiple users, not an IO space, the natural API is >> struct page. > > Let me walk back on this: > >> I'd say that this particular series mostly addresses the 'pfn as >> sector_t' side of the equation, where persistent memory is IO space, >> not memory space, and as such it is the more natural and thus also >> the cheaper/faster approach. > > ... but that does not appear to be the case: this series replaces a > 'struct page' interface with a pure pfn interface for the express > purpose of being able to DMA to/from 'memory areas' that are not > struct page backed. > >> Linus probably disagrees? :-) > > [ and he'd disagree rightfully ;-) ] > > So what this patch set tries to achieve is (sector_t -> sector_t) IO > between storage devices (i.e. a rare and somewhat weird usecase), and > does it by squeezing one device's storage address into our formerly > struct page backed descriptor, via a pfn. > > That looks like a layering violation and a mistake to me. If we want > to do direct (sector_t -> sector_t) IO, with no serialization worries, > it should have its own (simple) API - which things like hierarchical > RAID or RDMA APIs could use. I'm wrapped around the idea that __pfn_t *is* that simple api for the tiered storage driver use case. For RDMA I think we need struct page because I assume that would be coordinated through a filesystem an truncate() is back in play. What does an alternative API look like? > If what we want to do is to support say an mmap() of a file on > persistent storage, and then read() into that file from another device > via DMA, then I think we should have allocated struct page backing at > mmap() time already, and all regular syscall APIs would 'just work' > from that point on - far above what page-less, pfn-based APIs can do. > > The temporary struct page backing can then be freed at munmap() time. Yes, passing around mmap()'d (DAX) persistent memory will need more than a __pfn_t. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds wrote: > On Wed, May 6, 2015 at 7:36 PM, Dan Williams wrote: >> >> My pet concrete example is covered by __pfn_t. Referencing persistent >> memory in an md/dm hierarchical storage configuration. Setting aside >> the thrash to get existing block users to do "bvec_set_page(page)" >> instead of "bvec->page = page" the onus is on that md/dm >> implementation and backing storage device driver to operate on >> __pfn_t. That use case is simple because there is no use of page >> locking or refcounting in that path, just dma_map_page() and >> kmap_atomic(). > > So clarify for me: are you trying to make the IO stack in general be > able to use the persistent memory as a source (or destination) for IO > to _other_ devices, or are you talking about just internally shuffling > things around for something like RAID on top of persistent memory? > > Because I think those are two very different things. Yes, they are, and I am referring to the former, persistent memory as a source/destination to other devices. > For example, one of the things I worry about is for people doing IO > from persistent memory directly to some "slow stable storage" (aka > disk). That was what I thought you were aiming for: infrastructure so > that you can make a bio for a *disk* device contain a page list that > is the persistent memory. > > And I think that is a very dangerous operation to do, because the > persistent memory itself is going to have some filesystem on it, so > anything that looks up the persistent memory pages is *not* going to > have a stable pfn: the pfn will point to a fixed part of the > persistent memory, but the file that was there may be deleted and the > memory reassigned to something else. Indeed, truncate() in the absence of struct page has been a major hurdle for persistent memory enabling. But it does not impact this specific md/dm use case. md/dm will have taken an exclusive claim on an entire pmem block device (or partition), so there will be no competing with a filesystem. > That's the kind of thing that "struct page" helps with for normal IO > devices. It's both a source of serialization and indirection, so that > when somebody does a "truncate()" on a file, we don't end up doing IO > to random stale locations on the disk that got reassigned to another > file. > > So "struct page" is very fundamental. It's *not* just a "this is the > physical source/drain of the data you are doing IO on". > > So if you are looking at some kind of "zero-copy IO", where you can do > IO from a filesystem on persistent storage to *another* filesystem on > (say, a big rotational disk used for long-term storage) by just doing > a bo that targets the disk, but has the persistent memory as the > source memory, I really want to understand how you are going to > serialize this. > > So *that* is what I meant by "What is the primary thing that is > driving this need? Do we have a very concrete example?" > > I abvsolutely do *not* want to teach the bio subsystem to just > randomly be able to take the source/destination of the IO as being > some random pfn without knowing what the actual uses are and how these > IO's are generated in the first place. blkdev_get(FMODE_EXCL) is the protection in this case. > I was assuming that you wanted to do something where you mmap() the > persistent memory, and then write it out to another device (possibly > using aio_write()). But that really does require some kind of > serialization at a higher level, because you can't just look up the > pfn's in the page table and assume they are stable: they are *not* > stable. We want to get there eventually, but this patchset does not address that case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 7:36 PM, Dan Williams wrote: > > My pet concrete example is covered by __pfn_t. Referencing persistent > memory in an md/dm hierarchical storage configuration. Setting aside > the thrash to get existing block users to do "bvec_set_page(page)" > instead of "bvec->page = page" the onus is on that md/dm > implementation and backing storage device driver to operate on > __pfn_t. That use case is simple because there is no use of page > locking or refcounting in that path, just dma_map_page() and > kmap_atomic(). So clarify for me: are you trying to make the IO stack in general be able to use the persistent memory as a source (or destination) for IO to _other_ devices, or are you talking about just internally shuffling things around for something like RAID on top of persistent memory? Because I think those are two very different things. For example, one of the things I worry about is for people doing IO from persistent memory directly to some "slow stable storage" (aka disk). That was what I thought you were aiming for: infrastructure so that you can make a bio for a *disk* device contain a page list that is the persistent memory. And I think that is a very dangerous operation to do, because the persistent memory itself is going to have some filesystem on it, so anything that looks up the persistent memory pages is *not* going to have a stable pfn: the pfn will point to a fixed part of the persistent memory, but the file that was there may be deleted and the memory reassigned to something else. That's the kind of thing that "struct page" helps with for normal IO devices. It's both a source of serialization and indirection, so that when somebody does a "truncate()" on a file, we don't end up doing IO to random stale locations on the disk that got reassigned to another file. So "struct page" is very fundamental. It's *not* just a "this is the physical source/drain of the data you are doing IO on". So if you are looking at some kind of "zero-copy IO", where you can do IO from a filesystem on persistent storage to *another* filesystem on (say, a big rotational disk used for long-term storage) by just doing a bo that targets the disk, but has the persistent memory as the source memory, I really want to understand how you are going to serialize this. So *that* is what I meant by "What is the primary thing that is driving this need? Do we have a very concrete example?" I abvsolutely do *not* want to teach the bio subsystem to just randomly be able to take the source/destination of the IO as being some random pfn without knowing what the actual uses are and how these IO's are generated in the first place. I was assuming that you wanted to do something where you mmap() the persistent memory, and then write it out to another device (possibly using aio_write()). But that really does require some kind of serialization at a higher level, because you can't just look up the pfn's in the page table and assume they are stable: they are *not* stable. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Ingo Molnar wrote: > [...] > > For anything more complex, that maps any of this storage to > user-space, or exposes it to higher level struct page based APIs, > etc., where references matter and it's more of a cache with > potentially multiple users, not an IO space, the natural API is > struct page. Let me walk back on this: > I'd say that this particular series mostly addresses the 'pfn as > sector_t' side of the equation, where persistent memory is IO space, > not memory space, and as such it is the more natural and thus also > the cheaper/faster approach. ... but that does not appear to be the case: this series replaces a 'struct page' interface with a pure pfn interface for the express purpose of being able to DMA to/from 'memory areas' that are not struct page backed. > Linus probably disagrees? :-) [ and he'd disagree rightfully ;-) ] So what this patch set tries to achieve is (sector_t -> sector_t) IO between storage devices (i.e. a rare and somewhat weird usecase), and does it by squeezing one device's storage address into our formerly struct page backed descriptor, via a pfn. That looks like a layering violation and a mistake to me. If we want to do direct (sector_t -> sector_t) IO, with no serialization worries, it should have its own (simple) API - which things like hierarchical RAID or RDMA APIs could use. If what we want to do is to support say an mmap() of a file on persistent storage, and then read() into that file from another device via DMA, then I think we should have allocated struct page backing at mmap() time already, and all regular syscall APIs would 'just work' from that point on - far above what page-less, pfn-based APIs can do. The temporary struct page backing can then be freed at munmap() time. And if the usage is pure fd based, we don't really have fd-to-fd APIs beyond the rarely used splice variants (and even those don't do pure cross-IO, they use a pipe as an intermediary), so there's no problem to solve I suspect. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams wrote: > > What is the primary thing that is driving this need? Do we have a > > very concrete example? > > My pet concrete example is covered by __pfn_t. Referencing > persistent memory in an md/dm hierarchical storage configuration. > Setting aside the thrash to get existing block users to do > "bvec_set_page(page)" instead of "bvec->page = page" the onus is on > that md/dm implementation and backing storage device driver to > operate on __pfn_t. That use case is simple because there is no use > of page locking or refcounting in that path, just dma_map_page() and > kmap_atomic(). The more difficult use case is precisely what Al > picked up on, O_DIRECT and RDMA. This patchset does nothing to > address those use cases outside of not needing a struct page when > they eventually craft a bio. So why not do a dual approach? There are code paths where the 'pfn' of a persistent device is mostly used as a sector_t equivalent of terabytes of storage, not as an index of a memory object. It's not an address to a cache, it's an index into a huge storage space - which happens to be (flash) RAM. For them using pfn_t seems natural and using struct page * is a strained (not to mention expensive) model. For more complex facilities, where persistent memory is used as a memory object, especially where the underlying device is true, unfinitely writable RAM (not flash), treating it as a memory zone, or setting up dynamic struct page would be the natural approach. (with the inevitable cost of setup/teardown in the latter case) I'd say that for anything where the dynamic struct page is torn down unconditionally after completion of only a single use, the natural API is probably pfn_t, not struct page. Any synchronization is already handled at the block request layer already, and it's storage op synchronization, not memory access synchronization really. For anything more complex, that maps any of this storage to user-space, or exposes it to higher level struct page based APIs, etc., where references matter and it's more of a cache with potentially multiple users, not an IO space, the natural API is struct page. I'd say that this particular series mostly addresses the 'pfn as sector_t' side of the equation, where persistent memory is IO space, not memory space, and as such it is the more natural and thus also the cheaper/faster approach. Linus probably disagrees? :-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams dan.j.willi...@intel.com wrote: What is the primary thing that is driving this need? Do we have a very concrete example? My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). The more difficult use case is precisely what Al picked up on, O_DIRECT and RDMA. This patchset does nothing to address those use cases outside of not needing a struct page when they eventually craft a bio. So why not do a dual approach? There are code paths where the 'pfn' of a persistent device is mostly used as a sector_t equivalent of terabytes of storage, not as an index of a memory object. It's not an address to a cache, it's an index into a huge storage space - which happens to be (flash) RAM. For them using pfn_t seems natural and using struct page * is a strained (not to mention expensive) model. For more complex facilities, where persistent memory is used as a memory object, especially where the underlying device is true, unfinitely writable RAM (not flash), treating it as a memory zone, or setting up dynamic struct page would be the natural approach. (with the inevitable cost of setup/teardown in the latter case) I'd say that for anything where the dynamic struct page is torn down unconditionally after completion of only a single use, the natural API is probably pfn_t, not struct page. Any synchronization is already handled at the block request layer already, and it's storage op synchronization, not memory access synchronization really. For anything more complex, that maps any of this storage to user-space, or exposes it to higher level struct page based APIs, etc., where references matter and it's more of a cache with potentially multiple users, not an IO space, the natural API is struct page. I'd say that this particular series mostly addresses the 'pfn as sector_t' side of the equation, where persistent memory is IO space, not memory space, and as such it is the more natural and thus also the cheaper/faster approach. Linus probably disagrees? :-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Something like get_user_pages() triggers memory hotplug for persistent memory, so they are actual real struct pages? Can we do memory hotplug at that granularity? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Ingo Molnar mi...@kernel.org wrote: Is handling kernel pagefault on the vmemmap completely out of the picture ? So we would carveout a chunck of kernel address space for those pfn and use it for vmemmap and handle pagefault on it. That's pretty clever. The page fault doesn't even have to do remote TLB shootdown, because it only establishes mappings - so it's pretty atomic, a bit like the minor vmalloc() area faults we are doing. Some sort of LRA (least recently allocated) scheme could unmap the area in chunks if it's beyond a certain size, to keep a limit on size. Done from the same context and would use remote TLB shootdown. The only limitation I can see is that such faults would have to be able to sleep, to do the allocation. So pfn_to_page() could not be used in arbitrary contexts. So another complication would be that we cannot just unmap such pages when we want to recycle them, because the struct page in them might be in use - so all struct page uses would have to refcount the underlying page. We don't really do that today: code just looks up struct pages and assumes they never go away. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 09:11:07PM +0200, Ingo Molnar wrote: * Dave Hansen dave.han...@linux.intel.com wrote: On 05/07/2015 10:42 AM, Dan Williams wrote: On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Something like get_user_pages() triggers memory hotplug for persistent memory, so they are actual real struct pages? Can we do memory hotplug at that granularity? We've traditionally limited them to SECTION_SIZE granularity, which is 128MB IIRC. There are also assumptions in places that you can do page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. I really don't think that's very practical: memory hotplug is slow, it's really not on the same abstraction level as mmap(), and the zone data structures are also fundamentally very coarse: not just because RAM ranges are huge, but also so that the pfn-page transformation stays relatively simple and fast. But, in all practicality, a lot of those places are in code like the buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're not ever expecting these fake 'struct page's to hit these code paths, it probably doesn't matter. You can probably get away with just allocating PAGE_SIZE worth of 'struct page' (which is 64) and mapping it in to vmemmap[]. The worst case is that you'll eat 1 page of space for each outstanding page of I/O. That's a lot better than 2MB of temporary 'struct page' space per page of I/O that it would take with a traditional hotplug operation. So I think the main value of struct page is if everyone on the system sees the same struct page for the same pfn - not just the temporary IO instance. The idea of having very temporary struct page arrays misses the point I think: if struct page is used as essentially an IO sglist then most of the synchronization properties are lost: then we might as well use the real deal in that case and skip the dynamic allocation and use pfns directly and avoid the dynamic allocation overhead. Stable, global page-struct descriptors are a given for real RAM, where we allocate a struct page for every page in nice, large, mostly linear arrays. We'd really need that for pmem too, to get the full power of struct page: and that means allocating them in nice, large, predictable places - such as on the device itself ... Is handling kernel pagefault on the vmemmap completely out of the picture ? So we would carveout a chunck of kernel address space for those pfn and use it for vmemmap and handle pagefault on it. Again here i think that GPU folks would like a solution where they can have a page struct but it would not be PMEM just device memory. So if we can come up with something generic enough to server both purpose that would be better in my view. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote: On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. Hmmm, the capacities of persistent memory that would be assigned for a raid accelerator would be limited by diminishing returns. I.e. there seems to be no point to assign more than 8GB or so to the cache? [...] Why would that be the case? If it's not a temporary cache but a persistent cache that hosts all the data even after writeback completes then going to huge sizes will bring similar benefits to using a large, fast SSD disk on your desktop... The larger, the better. And it also persists across reboots. True, that's more dm-cache than RAID accelerator, but point taken. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote: Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. Ok. And if we do decide to go with your kind of __pfn type, I'd probably prefer that we encode the type in the low bits of the word rather than compare against PAGE_OFFSET. On some architectures PAGE_OFFSET is zero (admittedly probably not ones you'd care about), but even on x86 it's a *lot* cheaper to test the low bit than it is to compare against a big constant. We know struct page * is supposed to be at least aligned to at least unsigned long, so you'd have two bits of type information (and we could easily make it three). With 0 being a real pointer, so that you can use the pointer itself without masking. And the hide type in low bits of pointer is something we've done quite a lot, so it's more kernel coding style anyway. Ok. Although __pfn_t also stores pfn values directly which will consume those 2 bits so we'll need to shift pfns up when storing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 09:53:13PM +0200, Ingo Molnar wrote: * Ingo Molnar mi...@kernel.org wrote: Is handling kernel pagefault on the vmemmap completely out of the picture ? So we would carveout a chunck of kernel address space for those pfn and use it for vmemmap and handle pagefault on it. That's pretty clever. The page fault doesn't even have to do remote TLB shootdown, because it only establishes mappings - so it's pretty atomic, a bit like the minor vmalloc() area faults we are doing. Some sort of LRA (least recently allocated) scheme could unmap the area in chunks if it's beyond a certain size, to keep a limit on size. Done from the same context and would use remote TLB shootdown. The only limitation I can see is that such faults would have to be able to sleep, to do the allocation. So pfn_to_page() could not be used in arbitrary contexts. So another complication would be that we cannot just unmap such pages when we want to recycle them, because the struct page in them might be in use - so all struct page uses would have to refcount the underlying page. We don't really do that today: code just looks up struct pages and assumes they never go away. I still think this is doable, like i said in another email, i think we should introduce a special pfn_to_page_dev|pmem|waffle|somethingyoulike() to place that are allowed to allocate the underlying struct page. For instance we can use a default page to backup all this special vmem range with some specialy crafted struct page that says that its is invalid memory (make this mapping read only so all write to this special struct page is forbidden). Now once an authorized user comes along and need a real struct page it trigger a page allocation that replace the page full of fake invalid struct page with a page with correct valid struct page that can be manipulated by other part of the kernel. So regular pfn_to_page() would test against special vmemmap and if special test the content of struct page for some flag. If it's the invalid page flag it returns 0. But once a proper struct page is allocated then pfn_page would return the struct page as expected. That way you will catch all invalid user of such page ie user that use the page after its lifetime is done. You will also limit the creation of the underlying proper struct page to only code that are legitimate to ask for a proper struct page for given pfn. Also you would get kernel write fault on the page full of fake struct page and that would allow to catch further wrong use. Anyway this is how i envision this and i think it would work for my usecase too (GPU it is for me :)) Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On 05/07/2015 10:42 AM, Dan Williams wrote: On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Something like get_user_pages() triggers memory hotplug for persistent memory, so they are actual real struct pages? Can we do memory hotplug at that granularity? We've traditionally limited them to SECTION_SIZE granularity, which is 128MB IIRC. There are also assumptions in places that you can do page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. But, in all practicality, a lot of those places are in code like the buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're not ever expecting these fake 'struct page's to hit these code paths, it probably doesn't matter. You can probably get away with just allocating PAGE_SIZE worth of 'struct page' (which is 64) and mapping it in to vmemmap[]. The worst case is that you'll eat 1 page of space for each outstanding page of I/O. That's a lot better than 2MB of temporary 'struct page' space per page of I/O that it would take with a traditional hotplug operation. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams dan.j.willi...@intel.com wrote: On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote: On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. Hmmm, the capacities of persistent memory that would be assigned for a raid accelerator would be limited by diminishing returns. I.e. there seems to be no point to assign more than 8GB or so to the cache? [...] Why would that be the case? If it's not a temporary cache but a persistent cache that hosts all the data even after writeback completes then going to huge sizes will bring similar benefits to using a large, fast SSD disk on your desktop... The larger, the better. And it also persists across reboots. It could also host the RAID write intent bitmap (the dirty stripes/chunks bitmap) for extra speedups. (This bitmap is pretty small, but important to speed up resyncs after crashes or power loss.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Jerome Glisse j.gli...@gmail.com wrote: So I think the main value of struct page is if everyone on the system sees the same struct page for the same pfn - not just the temporary IO instance. The idea of having very temporary struct page arrays misses the point I think: if struct page is used as essentially an IO sglist then most of the synchronization properties are lost: then we might as well use the real deal in that case and skip the dynamic allocation and use pfns directly and avoid the dynamic allocation overhead. Stable, global page-struct descriptors are a given for real RAM, where we allocate a struct page for every page in nice, large, mostly linear arrays. We'd really need that for pmem too, to get the full power of struct page: and that means allocating them in nice, large, predictable places - such as on the device itself ... Is handling kernel pagefault on the vmemmap completely out of the picture ? So we would carveout a chunck of kernel address space for those pfn and use it for vmemmap and handle pagefault on it. That's pretty clever. The page fault doesn't even have to do remote TLB shootdown, because it only establishes mappings - so it's pretty atomic, a bit like the minor vmalloc() area faults we are doing. Some sort of LRA (least recently allocated) scheme could unmap the area in chunks if it's beyond a certain size, to keep a limit on size. Done from the same context and would use remote TLB shootdown. The only limitation I can see is that such faults would have to be able to sleep, to do the allocation. So pfn_to_page() could not be used in arbitrary contexts. Again here i think that GPU folks would like a solution where they can have a page struct but it would not be PMEM just device memory. So if we can come up with something generic enough to server both purpose that would be better in my view. Yes. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote: Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. Ok. And if we do decide to go with your kind of __pfn type, I'd probably prefer that we encode the type in the low bits of the word rather than compare against PAGE_OFFSET. On some architectures PAGE_OFFSET is zero (admittedly probably not ones you'd care about), but even on x86 it's a *lot* cheaper to test the low bit than it is to compare against a big constant. We know struct page * is supposed to be at least aligned to at least unsigned long, so you'd have two bits of type information (and we could easily make it three). With 0 being a real pointer, so that you can use the pointer itself without masking. And the hide type in low bits of pointer is something we've done quite a lot, so it's more kernel coding style anyway. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams dan.j.willi...@intel.com wrote: That looks like a layering violation and a mistake to me. If we want to do direct (sector_t - sector_t) IO, with no serialization worries, it should have its own (simple) API - which things like hierarchical RAID or RDMA APIs could use. I'm wrapped around the idea that __pfn_t *is* that simple api for the tiered storage driver use case. [...] I agree. (see my previous mail) [...] For RDMA I think we need struct page because I assume that would be coordinated through a filesystem an truncate() is back in play. So I don't think RDMA is necessarily special, it's just a weirdly programmed DMA request: - If it is used internally by an exclusively managed complex storage driver, then it can use low level block APIs and pfn_t. - If RDMA is exposed all the way to user-space (do we have such APIs?), allowing users to initiate RDMA IO into user buffers, then (the user visible) buffer needs struct page backing. (which in turn will then at some lower level convert to pfns.) That's true for both regular RAM pages and mmap()-ed persistent RAM pages as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dave Hansen dave.han...@linux.intel.com wrote: On 05/07/2015 10:42 AM, Dan Williams wrote: On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote: * Dan Williams dan.j.willi...@intel.com wrote: So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Something like get_user_pages() triggers memory hotplug for persistent memory, so they are actual real struct pages? Can we do memory hotplug at that granularity? We've traditionally limited them to SECTION_SIZE granularity, which is 128MB IIRC. There are also assumptions in places that you can do page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE. I really don't think that's very practical: memory hotplug is slow, it's really not on the same abstraction level as mmap(), and the zone data structures are also fundamentally very coarse: not just because RAM ranges are huge, but also so that the pfn-page transformation stays relatively simple and fast. But, in all practicality, a lot of those places are in code like the buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're not ever expecting these fake 'struct page's to hit these code paths, it probably doesn't matter. You can probably get away with just allocating PAGE_SIZE worth of 'struct page' (which is 64) and mapping it in to vmemmap[]. The worst case is that you'll eat 1 page of space for each outstanding page of I/O. That's a lot better than 2MB of temporary 'struct page' space per page of I/O that it would take with a traditional hotplug operation. So I think the main value of struct page is if everyone on the system sees the same struct page for the same pfn - not just the temporary IO instance. The idea of having very temporary struct page arrays misses the point I think: if struct page is used as essentially an IO sglist then most of the synchronization properties are lost: then we might as well use the real deal in that case and skip the dynamic allocation and use pfns directly and avoid the dynamic allocation overhead. Stable, global page-struct descriptors are a given for real RAM, where we allocate a struct page for every page in nice, large, mostly linear arrays. We'd really need that for pmem too, to get the full power of struct page: and that means allocating them in nice, large, predictable places - such as on the device itself ... It might even be 'scattered' across the device, with 64 byte struct page size we can pack 64 descriptors into a single page, so every 65 pages we could have a page-struct page. Finding a pmem page's struct page would thus involve rounding it modulo 65 and reading that page. The problem with that is fourfold: - that we now turn a very kernel internal API and data structure into an ABI. If struct page grows beyond 64 bytes it's a problem. - on bootup (or device discovery time) we'd have to initialize all the page structs. We could probably do this in a hierarchical way, by dividing continuous pmem ranges into power-of-two groups of blocks, and organizing them like the buddy allocator does. - 1.5% of storage space lost. - will wear-leveling properly migrate these 'hot' pages around? The alternative would be some global interval-rbtree of struct page backed pmem ranges. Beyond the synchronization problems of such a data structure (which looks like a nightmare) I don't think it's even feasible: especially if there's a filesystem on the pmem device then the block allocations could be physically fragmented (and there's no fundamental reason why they couldn't be fragmented), so a continuous mmap() of a file on it will yield wildly fragmented device-pfn ranges, exploding the rbtree. Think 1 million node interval-rbtree with an average depth of 20: cachemiss country for even simple lookups - not to mention the freeing/recycling complexity of unused struct pages to not allow it to grow too large. I might be wrong though about all this :) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Al, I was wondering about the struct page rules of iov_iter_get_pages_alloc(), used in various places. There's no documentation whatsoever in lib/iov_iter.c, nor in include/linux/uio.h, and the changelog that introduced it only says: commit 91f79c43d1b54d7154b118860d81b39bad07dfff Author: Al Viro v...@zeniv.linux.org.uk Date: Fri Mar 21 04:58:33 2014 -0400 new helper: iov_iter_get_pages_alloc() same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS -direct_IO() switched to it. Signed-off-by: Al Viro v...@zeniv.linux.org.uk So if code does iov_iter_get_pages_alloc() on a user address that has a real struct page behind it - and some other code does a regular get_user_pages() on it, we'll have two sets of struct page descriptors, the 'real' one, and a fake allocated one, right? How does that work? Nobody else can ever discover these fake page structs, so they don't really serve any 'real' synchronization purpose other than the limited role of IO completion. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Ingo Molnar mi...@kernel.org wrote: [...] For anything more complex, that maps any of this storage to user-space, or exposes it to higher level struct page based APIs, etc., where references matter and it's more of a cache with potentially multiple users, not an IO space, the natural API is struct page. Let me walk back on this: I'd say that this particular series mostly addresses the 'pfn as sector_t' side of the equation, where persistent memory is IO space, not memory space, and as such it is the more natural and thus also the cheaper/faster approach. ... but that does not appear to be the case: this series replaces a 'struct page' interface with a pure pfn interface for the express purpose of being able to DMA to/from 'memory areas' that are not struct page backed. Linus probably disagrees? :-) [ and he'd disagree rightfully ;-) ] So what this patch set tries to achieve is (sector_t - sector_t) IO between storage devices (i.e. a rare and somewhat weird usecase), and does it by squeezing one device's storage address into our formerly struct page backed descriptor, via a pfn. That looks like a layering violation and a mistake to me. If we want to do direct (sector_t - sector_t) IO, with no serialization worries, it should have its own (simple) API - which things like hierarchical RAID or RDMA APIs could use. If what we want to do is to support say an mmap() of a file on persistent storage, and then read() into that file from another device via DMA, then I think we should have allocated struct page backing at mmap() time already, and all regular syscall APIs would 'just work' from that point on - far above what page-less, pfn-based APIs can do. The temporary struct page backing can then be freed at munmap() time. And if the usage is pure fd based, we don't really have fd-to-fd APIs beyond the rarely used splice variants (and even those don't do pure cross-IO, they use a pipe as an intermediary), so there's no problem to solve I suspect. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar mi...@kernel.org wrote: * Ingo Molnar mi...@kernel.org wrote: [...] For anything more complex, that maps any of this storage to user-space, or exposes it to higher level struct page based APIs, etc., where references matter and it's more of a cache with potentially multiple users, not an IO space, the natural API is struct page. Let me walk back on this: I'd say that this particular series mostly addresses the 'pfn as sector_t' side of the equation, where persistent memory is IO space, not memory space, and as such it is the more natural and thus also the cheaper/faster approach. ... but that does not appear to be the case: this series replaces a 'struct page' interface with a pure pfn interface for the express purpose of being able to DMA to/from 'memory areas' that are not struct page backed. Linus probably disagrees? :-) [ and he'd disagree rightfully ;-) ] So what this patch set tries to achieve is (sector_t - sector_t) IO between storage devices (i.e. a rare and somewhat weird usecase), and does it by squeezing one device's storage address into our formerly struct page backed descriptor, via a pfn. That looks like a layering violation and a mistake to me. If we want to do direct (sector_t - sector_t) IO, with no serialization worries, it should have its own (simple) API - which things like hierarchical RAID or RDMA APIs could use. I'm wrapped around the idea that __pfn_t *is* that simple api for the tiered storage driver use case. For RDMA I think we need struct page because I assume that would be coordinated through a filesystem an truncate() is back in play. What does an alternative API look like? If what we want to do is to support say an mmap() of a file on persistent storage, and then read() into that file from another device via DMA, then I think we should have allocated struct page backing at mmap() time already, and all regular syscall APIs would 'just work' from that point on - far above what page-less, pfn-based APIs can do. The temporary struct page backing can then be freed at munmap() time. Yes, passing around mmap()'d (DAX) persistent memory will need more than a __pfn_t. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote: My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). So clarify for me: are you trying to make the IO stack in general be able to use the persistent memory as a source (or destination) for IO to _other_ devices, or are you talking about just internally shuffling things around for something like RAID on top of persistent memory? Because I think those are two very different things. Yes, they are, and I am referring to the former, persistent memory as a source/destination to other devices. For example, one of the things I worry about is for people doing IO from persistent memory directly to some slow stable storage (aka disk). That was what I thought you were aiming for: infrastructure so that you can make a bio for a *disk* device contain a page list that is the persistent memory. And I think that is a very dangerous operation to do, because the persistent memory itself is going to have some filesystem on it, so anything that looks up the persistent memory pages is *not* going to have a stable pfn: the pfn will point to a fixed part of the persistent memory, but the file that was there may be deleted and the memory reassigned to something else. Indeed, truncate() in the absence of struct page has been a major hurdle for persistent memory enabling. But it does not impact this specific md/dm use case. md/dm will have taken an exclusive claim on an entire pmem block device (or partition), so there will be no competing with a filesystem. That's the kind of thing that struct page helps with for normal IO devices. It's both a source of serialization and indirection, so that when somebody does a truncate() on a file, we don't end up doing IO to random stale locations on the disk that got reassigned to another file. So struct page is very fundamental. It's *not* just a this is the physical source/drain of the data you are doing IO on. So if you are looking at some kind of zero-copy IO, where you can do IO from a filesystem on persistent storage to *another* filesystem on (say, a big rotational disk used for long-term storage) by just doing a bo that targets the disk, but has the persistent memory as the source memory, I really want to understand how you are going to serialize this. So *that* is what I meant by What is the primary thing that is driving this need? Do we have a very concrete example? I abvsolutely do *not* want to teach the bio subsystem to just randomly be able to take the source/destination of the IO as being some random pfn without knowing what the actual uses are and how these IO's are generated in the first place. blkdev_get(FMODE_EXCL) is the protection in this case. I was assuming that you wanted to do something where you mmap() the persistent memory, and then write it out to another device (possibly using aio_write()). But that really does require some kind of serialization at a higher level, because you can't just look up the pfn's in the page table and assume they are stable: they are *not* stable. We want to get there eventually, but this patchset does not address that case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote: blkdev_get(FMODE_EXCL) is the protection in this case. Ugh. That looks like a horrible nasty big hammer that will bite us badly some day. Since you'd have to hold it for the whole IO. But I guess it at least works. Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote: blkdev_get(FMODE_EXCL) is the protection in this case. Ugh. That looks like a horrible nasty big hammer that will bite us badly some day. Since you'd have to hold it for the whole IO. But I guess it at least works. Oh no, that wouldn't be per-I/O that would be permanent at configuration set up time just like a raid member device. Something like: mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote: My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). So clarify for me: are you trying to make the IO stack in general be able to use the persistent memory as a source (or destination) for IO to _other_ devices, or are you talking about just internally shuffling things around for something like RAID on top of persistent memory? Because I think those are two very different things. For example, one of the things I worry about is for people doing IO from persistent memory directly to some slow stable storage (aka disk). That was what I thought you were aiming for: infrastructure so that you can make a bio for a *disk* device contain a page list that is the persistent memory. And I think that is a very dangerous operation to do, because the persistent memory itself is going to have some filesystem on it, so anything that looks up the persistent memory pages is *not* going to have a stable pfn: the pfn will point to a fixed part of the persistent memory, but the file that was there may be deleted and the memory reassigned to something else. That's the kind of thing that struct page helps with for normal IO devices. It's both a source of serialization and indirection, so that when somebody does a truncate() on a file, we don't end up doing IO to random stale locations on the disk that got reassigned to another file. So struct page is very fundamental. It's *not* just a this is the physical source/drain of the data you are doing IO on. So if you are looking at some kind of zero-copy IO, where you can do IO from a filesystem on persistent storage to *another* filesystem on (say, a big rotational disk used for long-term storage) by just doing a bo that targets the disk, but has the persistent memory as the source memory, I really want to understand how you are going to serialize this. So *that* is what I meant by What is the primary thing that is driving this need? Do we have a very concrete example? I abvsolutely do *not* want to teach the bio subsystem to just randomly be able to take the source/destination of the IO as being some random pfn without knowing what the actual uses are and how these IO's are generated in the first place. I was assuming that you wanted to do something where you mmap() the persistent memory, and then write it out to another device (possibly using aio_write()). But that really does require some kind of serialization at a higher level, because you can't just look up the pfn's in the page table and assume they are stable: they are *not* stable. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote: On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. Hmmm, the capacities of persistent memory that would be assigned for a raid accelerator would be limited by diminishing returns. I.e. there seems to be no point to assign more than 8GB or so to the cache? If that's the case the capacity argument loses some teeth, just blk_get(FMODE_EXCL) + memory_hotplug a small capacity and be done. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Thu, May 07, 2015 at 06:18:07PM +0200, Christoph Hellwig wrote: On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote: What is the primary thing that is driving this need? Do we have a very concrete example? FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to ue pages for that. The code just merge for 4.1 can easily support page backing, and I plan to use that for now. This still leaves support for the gigantic intel nvdimms discovered over EFI out, but given that I don't have access to them, and I dont know of any publically available there's little I can do for now. But adding on demand allocate struct pages for the seems like the easiest way forward. Boaz already has code to allocate pages for them, although not on demand but at boot / plug in time. I think here other folks might be interested, i am ccing Paul. But for GPU we are facing similar issue of trying to present the GPU memory to the kernel in a coherent way (coherent from the design and linux kernel concept POV). For this dynamicaly allocated struct page might effectively be a solution that could be share btw persistent memory and GPU folks. We can even enforce thing like VMEMMAP and have special region carveout where we can dynamicly map/unmap backing page for range of device pfn. This would also allow to catch people trying to access such page, we could add a set of new helper like : get_page_dev()/put_page_dev() ... and only the _dev version would works on this new kind of memory, regular get_page()/put_page() would throw error. This should allow to make sure only legitimate users are referencing such page. Issue might be that we can run out of kernel address space with 48bits but if such monstruous computer ever see the light of day they might consider using CPU with more bits. Another issue is that we might care for the 32bits platform too, but that's solvable at a small cost. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
* Dan Williams dan.j.willi...@intel.com wrote: Anyway, I did want to say that while I may not be convinced about the approach, I think the patches themselves don't look horrible. I actually like your __pfn_t. So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code. Ok, I'll keep thinking about this and come back when we have a better story about passing mmap'd persistent memory around in userspace. So is there anything fundamentally wrong about creating struct page backing at mmap() time (and making sure aliased mmaps share struct page arrays)? Because if that is done, then the DMA agent won't even know about the memory being persistent RAM. It's just a regular struct page, that happens to point to persistent RAM. Same goes for all the high level VM APIs, futexes, etc. Everything will Just Work. It will also be relatively fast: mmap() is a relative slowpath, comparatively. As far as RAID is concerned: that's a relatively easy situation, as there's only a single user of the devices, the RAID context that manages all component devices exclusively. Device to device DMA can use the block layer directly, i.e. most of the patches you've got here in this series, except: 74287 C May 06 Dan Williams( 232) ├─[PATCH v2 09/10] dax: convert to __pfn_t I think DAX mmap()s need struct page backing. I think there's a simple rule: if a page is visible to user-space via the MMU then it needs struct page backing. If it's hidden, like behind a RAID abstraction, it probably doesn't. With the remaining patches a high level RAID driver ought to be able to send pfn-to-sector and sector-to-pfn requests to other block drivers, without any unnecessary struct page allocation overhead, right? As long as the pfn concept remains a clever way to reuse our ram-sector interfaces to implement sector-sector IO, in the cases where the IO has no serialization or MMU concerns, not using struct page and using pfn_t looks natural. The moment it starts reaching user space APIs, like in the DAX case, and especially if it becomes user-MMU visible, it's a mistake to not have struct page backing, I think. (In that sense the current DAX mmap() code is already a partial mistake.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds wrote: > On Wed, May 6, 2015 at 4:47 PM, Dan Williams wrote: >> >> Conceptually better, but certainly more difficult to audit if the fake >> struct page is initialized in a subtle way that breaks when/if it >> leaks to some unwitting context. > > Maybe. It could go either way, though. In particular, with the > "dynamically allocated struct page" approach, if somebody uses it past > the supposed lifetime of the use, things like poisoning the temporary > "struct page" could be fairly effective. You can't really poison the > pfn - it's just a number, and if somebody uses it later than you think > (and you have re-used that physical memory for something else), you'll > never ever know. True, but there's little need to poison a _pfn_t because it's permanent once discovered via ->direct_access() on the hosting struct block_device. Sure, kmap_atomic_pfn_t() may fail when the pmem driver unbinds from a device, but the __pfn_t is still valid. Obviously, we can only support atomic kmap(s) with this property, and it would be nice to fault if someone continued to use the __pfn_t after the hosting device was disabled. To be clear, DAX has this same problem today. Nothing stops whomever called ->direct_access() to continue using the pfn after the backing device has been disabled. > I'd *assume* that most users of the dynamic "struct page" allocation > have very clear lifetime rules. Those things would presumably normally > get looked-up by some extended version of "get_user_pages()", and > there's a clear use of the result, with no longer lifetime. Also, you > do need to have some higher-level locking when you do this, to make > sure that the persistent pages don't magically get re-assigned. We're > presumably talking about having a filesystem in that persistent > memory, so we cannot be doing IO to the pages (from some other source > - whether RDMA or some special zero-copy model) while the underlying > filesystem is reassigning the storage because somebody deleted the > file. > > IOW, there had better be other external rules about when - and how > long - you can use a particular persistent page. No? So the whole > "when/how to allocate the temporary 'struct page'" is just another > detail in that whole thing. > > And yes, some uses may not ever actually see that. If the whole of > persistent memory is just assigned to a database or something, and the > DB just wants to do a "flush this range of persistent memory to > long-term disk storage", then there may not be much of a "lifetime" > issue for the persistent memory. But even then you're going to have IO > completion callbacks etc to let the DB know that it has hit the disk, > so.. > > What is the primary thing that is driving this need? Do we have a very > concrete example? My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do "bvec_set_page(page)" instead of "bvec->page = page" the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). The more difficult use case is precisely what Al picked up on, O_DIRECT and RDMA. This patchset does nothing to address those use cases outside of not needing a struct page when they eventually craft a bio. I know Matthew Wilcox has explored the idea of "get_user_sg()" and let the scatterlist hold the reference count and locks, but I'll let him speak to that. I still see __pfn_t as generally useful for the simple in-kernel stacked-block-i/o use case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 4:47 PM, Dan Williams wrote: > > Conceptually better, but certainly more difficult to audit if the fake > struct page is initialized in a subtle way that breaks when/if it > leaks to some unwitting context. Maybe. It could go either way, though. In particular, with the "dynamically allocated struct page" approach, if somebody uses it past the supposed lifetime of the use, things like poisoning the temporary "struct page" could be fairly effective. You can't really poison the pfn - it's just a number, and if somebody uses it later than you think (and you have re-used that physical memory for something else), you'll never ever know. I'd *assume* that most users of the dynamic "struct page" allocation have very clear lifetime rules. Those things would presumably normally get looked-up by some extended version of "get_user_pages()", and there's a clear use of the result, with no longer lifetime. Also, you do need to have some higher-level locking when you do this, to make sure that the persistent pages don't magically get re-assigned. We're presumably talking about having a filesystem in that persistent memory, so we cannot be doing IO to the pages (from some other source - whether RDMA or some special zero-copy model) while the underlying filesystem is reassigning the storage because somebody deleted the file. IOW, there had better be other external rules about when - and how long - you can use a particular persistent page. No? So the whole "when/how to allocate the temporary 'struct page'" is just another detail in that whole thing. And yes, some uses may not ever actually see that. If the whole of persistent memory is just assigned to a database or something, and the DB just wants to do a "flush this range of persistent memory to long-term disk storage", then there may not be much of a "lifetime" issue for the persistent memory. But even then you're going to have IO completion callbacks etc to let the DB know that it has hit the disk, so.. What is the primary thing that is driving this need? Do we have a very concrete example? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds wrote: > On Wed, May 6, 2015 at 1:04 PM, Dan Williams wrote: >> >> The motivation for this change is persistent memory and the desire to >> use it not only via the pmem driver, but also as a memory target for I/O >> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. > > I detest this approach. > Hmm, yes, I can't argue against "put the onus on odd behavior where it belongs."... > I'd much rather go exactly the other way around, and do the dynamic > "struct page" instead. > > Add a flag to "struct page" Ok, given I had already precluded 32-bit systems in this __pfn_t approach we should have flag space for this on 64-bit. > to mark it as a fake entry and teach > "page_to_pfn()" to look up the actual pfn some way (that union tha > contains "index" looks like a good target to also contain 'pfn', for > example). > > Especially if this is mainly for persistent storage, we'll never have > issues with worrying about writing it back under memory pressure, so > allocating a "struct page" for these things shouldn't be a problem. > There's likely only a few paths that actually generate IO for those > things. > > In other words, I'd really like our basic infrastructure to be for the > *normal* case, and the "struct page" is about so much more than just > "what's the target for IO". For normal IO, "struct page" is also what > serializes the IO so that you have a consistent view of the end > result, and there's obviously the reference count there too. So I > really *really* think that "struct page" is the better entity for > describing the actual IO, because it's the common and the generic > thing, while a "pfn" is not actually *enough* for IO in general, and > you now end up having to look up the "struct page" for the locking and > refcounting etc. > > If you go the other way, and instead generate a "struct page" from the > pfn for the few cases that need it, you put the onus on odd behavior > where it belongs. > > Yes, it might not be any simpler in the end, but I think it would be > conceptually much better. Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. The one benefit I may need to concede is a mechanism to opt-in to handle these fake pages to the few paths that know what they are doing. That was easy with __pfn_t, but a struct page can go silently almost anywhere. Certainly nothing is prepared a for a given struct page pointer to change the pfn it points to on the fly, which I think is what we would end up doing for something like a raid cache. Keep a pool of struct pages around and point them at persistent memory pfns while I/O is in flight. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 1:04 PM, Dan Williams wrote: > > The motivation for this change is persistent memory and the desire to > use it not only via the pmem driver, but also as a memory target for I/O > (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. I detest this approach. I'd much rather go exactly the other way around, and do the dynamic "struct page" instead. Add a flag to "struct page" to mark it as a fake entry and teach "page_to_pfn()" to look up the actual pfn some way (that union tha contains "index" looks like a good target to also contain 'pfn', for example). Especially if this is mainly for persistent storage, we'll never have issues with worrying about writing it back under memory pressure, so allocating a "struct page" for these things shouldn't be a problem. There's likely only a few paths that actually generate IO for those things. In other words, I'd really like our basic infrastructure to be for the *normal* case, and the "struct page" is about so much more than just "what's the target for IO". For normal IO, "struct page" is also what serializes the IO so that you have a consistent view of the end result, and there's obviously the reference count there too. So I really *really* think that "struct page" is the better entity for describing the actual IO, because it's the common and the generic thing, while a "pfn" is not actually *enough* for IO in general, and you now end up having to look up the "struct page" for the locking and refcounting etc. If you go the other way, and instead generate a "struct page" from the pfn for the few cases that need it, you put the onus on odd behavior where it belongs. Yes, it might not be any simpler in the end, but I think it would be conceptually much better. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 06, 2015 at 04:04:53PM -0400, Dan Williams wrote: > Changes since v1 [1]: > > 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers. > > 2/ added kmap_atomic_pfn_t() > > 3/ rebased on v4.1-rc2 > > [1]: http://marc.info/?l=linux-kernel=142653770511970=2 > > --- > > A lead in note, this looks scarier than it is. Most of the code thrash > is automated via Coccinelle. Also the subtle differences behind an > 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a > Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls > whether a pfn and a __pfn_t are equivalent. > > The motivation for this change is persistent memory and the desire to > use it not only via the pmem driver, but also as a memory target for I/O > (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. Aside > from the pmem driver and DAX, persistent memory is not able to be used > in these I/O scenarios due to the lack of a backing struct page, i.e. > persistent memory is not part of the memmap. This patchset takes the > position that the solution is to teach I/O paths that want to operate on > persistent memory to do so by referencing a __pfn_t. The alternatives > are discussed in the changelog for "[PATCH v2 01/10] arch: introduce > __pfn_t for persistent memory i/o", copied here: > > Alternatives: > > 1/ Provide struct page coverage for persistent memory in >DRAM. The expectation is that persistent memory capacities make >this untenable in the long term. > > 2/ Provide struct page coverage for persistent memory with >persistent memory. While persistent memory may have near DRAM >performance characteristics it may not have the same >write-endurance of DRAM. Given the update frequency of struct >page objects it may not be suitable for persistent memory. > > 3/ Dynamically allocate struct page. This appears to be on >the order of the complexity of converting code paths to use >__pfn_t references instead of struct page, and the amount of >setup required to establish a valid struct page reference is >mostly wasted when the only usage in the block stack is to >perform a page_to_pfn() conversion for dma-mapping. Instances >of kmap() / kmap_atomic() usage appear to be the only occasions >in the block stack where struct page is non-trivially used. A >new kmap_atomic_pfn_t() is proposed to handle those cases. *grumble* What are you going to do with things like iov_iter_get_pages()? Long-term, that is, after you go for "this pfn has no struct page for it"... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote: The motivation for this change is persistent memory and the desire to use it not only via the pmem driver, but also as a memory target for I/O (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. I detest this approach. I'd much rather go exactly the other way around, and do the dynamic struct page instead. Add a flag to struct page to mark it as a fake entry and teach page_to_pfn() to look up the actual pfn some way (that union tha contains index looks like a good target to also contain 'pfn', for example). Especially if this is mainly for persistent storage, we'll never have issues with worrying about writing it back under memory pressure, so allocating a struct page for these things shouldn't be a problem. There's likely only a few paths that actually generate IO for those things. In other words, I'd really like our basic infrastructure to be for the *normal* case, and the struct page is about so much more than just what's the target for IO. For normal IO, struct page is also what serializes the IO so that you have a consistent view of the end result, and there's obviously the reference count there too. So I really *really* think that struct page is the better entity for describing the actual IO, because it's the common and the generic thing, while a pfn is not actually *enough* for IO in general, and you now end up having to look up the struct page for the locking and refcounting etc. If you go the other way, and instead generate a struct page from the pfn for the few cases that need it, you put the onus on odd behavior where it belongs. Yes, it might not be any simpler in the end, but I think it would be conceptually much better. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote: The motivation for this change is persistent memory and the desire to use it not only via the pmem driver, but also as a memory target for I/O (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. I detest this approach. Hmm, yes, I can't argue against put the onus on odd behavior where it belongs I'd much rather go exactly the other way around, and do the dynamic struct page instead. Add a flag to struct page Ok, given I had already precluded 32-bit systems in this __pfn_t approach we should have flag space for this on 64-bit. to mark it as a fake entry and teach page_to_pfn() to look up the actual pfn some way (that union tha contains index looks like a good target to also contain 'pfn', for example). Especially if this is mainly for persistent storage, we'll never have issues with worrying about writing it back under memory pressure, so allocating a struct page for these things shouldn't be a problem. There's likely only a few paths that actually generate IO for those things. In other words, I'd really like our basic infrastructure to be for the *normal* case, and the struct page is about so much more than just what's the target for IO. For normal IO, struct page is also what serializes the IO so that you have a consistent view of the end result, and there's obviously the reference count there too. So I really *really* think that struct page is the better entity for describing the actual IO, because it's the common and the generic thing, while a pfn is not actually *enough* for IO in general, and you now end up having to look up the struct page for the locking and refcounting etc. If you go the other way, and instead generate a struct page from the pfn for the few cases that need it, you put the onus on odd behavior where it belongs. Yes, it might not be any simpler in the end, but I think it would be conceptually much better. Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. The one benefit I may need to concede is a mechanism to opt-in to handle these fake pages to the few paths that know what they are doing. That was easy with __pfn_t, but a struct page can go silently almost anywhere. Certainly nothing is prepared a for a given struct page pointer to change the pfn it points to on the fly, which I think is what we would end up doing for something like a raid cache. Keep a pool of struct pages around and point them at persistent memory pfns while I/O is in flight. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 06, 2015 at 04:04:53PM -0400, Dan Williams wrote: Changes since v1 [1]: 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers. 2/ added kmap_atomic_pfn_t() 3/ rebased on v4.1-rc2 [1]: http://marc.info/?l=linux-kernelm=142653770511970w=2 --- A lead in note, this looks scarier than it is. Most of the code thrash is automated via Coccinelle. Also the subtle differences behind an 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls whether a pfn and a __pfn_t are equivalent. The motivation for this change is persistent memory and the desire to use it not only via the pmem driver, but also as a memory target for I/O (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. Aside from the pmem driver and DAX, persistent memory is not able to be used in these I/O scenarios due to the lack of a backing struct page, i.e. persistent memory is not part of the memmap. This patchset takes the position that the solution is to teach I/O paths that want to operate on persistent memory to do so by referencing a __pfn_t. The alternatives are discussed in the changelog for [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o, copied here: Alternatives: 1/ Provide struct page coverage for persistent memory in DRAM. The expectation is that persistent memory capacities make this untenable in the long term. 2/ Provide struct page coverage for persistent memory with persistent memory. While persistent memory may have near DRAM performance characteristics it may not have the same write-endurance of DRAM. Given the update frequency of struct page objects it may not be suitable for persistent memory. 3/ Dynamically allocate struct page. This appears to be on the order of the complexity of converting code paths to use __pfn_t references instead of struct page, and the amount of setup required to establish a valid struct page reference is mostly wasted when the only usage in the block stack is to perform a page_to_pfn() conversion for dma-mapping. Instances of kmap() / kmap_atomic() usage appear to be the only occasions in the block stack where struct page is non-trivially used. A new kmap_atomic_pfn_t() is proposed to handle those cases. *grumble* What are you going to do with things like iov_iter_get_pages()? Long-term, that is, after you go for this pfn has no struct page for it... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote: Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. Maybe. It could go either way, though. In particular, with the dynamically allocated struct page approach, if somebody uses it past the supposed lifetime of the use, things like poisoning the temporary struct page could be fairly effective. You can't really poison the pfn - it's just a number, and if somebody uses it later than you think (and you have re-used that physical memory for something else), you'll never ever know. True, but there's little need to poison a _pfn_t because it's permanent once discovered via -direct_access() on the hosting struct block_device. Sure, kmap_atomic_pfn_t() may fail when the pmem driver unbinds from a device, but the __pfn_t is still valid. Obviously, we can only support atomic kmap(s) with this property, and it would be nice to fault if someone continued to use the __pfn_t after the hosting device was disabled. To be clear, DAX has this same problem today. Nothing stops whomever called -direct_access() to continue using the pfn after the backing device has been disabled. I'd *assume* that most users of the dynamic struct page allocation have very clear lifetime rules. Those things would presumably normally get looked-up by some extended version of get_user_pages(), and there's a clear use of the result, with no longer lifetime. Also, you do need to have some higher-level locking when you do this, to make sure that the persistent pages don't magically get re-assigned. We're presumably talking about having a filesystem in that persistent memory, so we cannot be doing IO to the pages (from some other source - whether RDMA or some special zero-copy model) while the underlying filesystem is reassigning the storage because somebody deleted the file. IOW, there had better be other external rules about when - and how long - you can use a particular persistent page. No? So the whole when/how to allocate the temporary 'struct page' is just another detail in that whole thing. And yes, some uses may not ever actually see that. If the whole of persistent memory is just assigned to a database or something, and the DB just wants to do a flush this range of persistent memory to long-term disk storage, then there may not be much of a lifetime issue for the persistent memory. But even then you're going to have IO completion callbacks etc to let the DB know that it has hit the disk, so.. What is the primary thing that is driving this need? Do we have a very concrete example? My pet concrete example is covered by __pfn_t. Referencing persistent memory in an md/dm hierarchical storage configuration. Setting aside the thrash to get existing block users to do bvec_set_page(page) instead of bvec-page = page the onus is on that md/dm implementation and backing storage device driver to operate on __pfn_t. That use case is simple because there is no use of page locking or refcounting in that path, just dma_map_page() and kmap_atomic(). The more difficult use case is precisely what Al picked up on, O_DIRECT and RDMA. This patchset does nothing to address those use cases outside of not needing a struct page when they eventually craft a bio. I know Matthew Wilcox has explored the idea of get_user_sg() and let the scatterlist hold the reference count and locks, but I'll let him speak to that. I still see __pfn_t as generally useful for the simple in-kernel stacked-block-i/o use case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote: Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. Maybe. It could go either way, though. In particular, with the dynamically allocated struct page approach, if somebody uses it past the supposed lifetime of the use, things like poisoning the temporary struct page could be fairly effective. You can't really poison the pfn - it's just a number, and if somebody uses it later than you think (and you have re-used that physical memory for something else), you'll never ever know. I'd *assume* that most users of the dynamic struct page allocation have very clear lifetime rules. Those things would presumably normally get looked-up by some extended version of get_user_pages(), and there's a clear use of the result, with no longer lifetime. Also, you do need to have some higher-level locking when you do this, to make sure that the persistent pages don't magically get re-assigned. We're presumably talking about having a filesystem in that persistent memory, so we cannot be doing IO to the pages (from some other source - whether RDMA or some special zero-copy model) while the underlying filesystem is reassigning the storage because somebody deleted the file. IOW, there had better be other external rules about when - and how long - you can use a particular persistent page. No? So the whole when/how to allocate the temporary 'struct page' is just another detail in that whole thing. And yes, some uses may not ever actually see that. If the whole of persistent memory is just assigned to a database or something, and the DB just wants to do a flush this range of persistent memory to long-term disk storage, then there may not be much of a lifetime issue for the persistent memory. But even then you're going to have IO completion callbacks etc to let the DB know that it has hit the disk, so.. What is the primary thing that is driving this need? Do we have a very concrete example? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/